How many DNA copies are there?
A copy number profile is a description of how many copies are present in the genome of the various pieces of DNA. A normal human genome consists of two copies of each of the chromosomes 1-22, and either two copies of chromosome X (females) or one copy of each of X and Y (males). A cancer genome may lack pieces of the genome (one or both copies) and have more than two copies of other pieces. This results in a non-flat copy number profile (when copy number is plotted against genomic position) and is a very common characteristic of cancer cells. Determining the break points (i.e. the genomic positions where the copy number jumps from one value to another) is called copy number segmentation. It may be posed as a statistical estimation problem, since the actual data obtained from molecular analyses of tumor cells are noisy (i.e. the observed copy number values at two genomic loci will not be identical even if the true copy numbers are identical). Depending on the technology used to obtain the measurements, the observation noise can vary substantially. In addition, most copy number analyses are based on pools of many cancer cells, and the individual cells may be genetically different, resulting in copy number jumps of varying amplitudes reflecting the proportion of cells exhibiting a particular change.
Our group has previously developed efficient algorithms for copy number segmentation, in collaboration with researchers in Cambridge, UK, and at Oslo University Hospital. These algorithms, which are implemented in R and available as the Bioconductor package copynumber, have been used and published in a series of applications from our group and others. An important feature of the algorithms (and other algorithms designed for the same purpose) is a smoothing parameter, i.e. a user-defined parameter that controls the sensitivity of the method. With this parameter, the user can tune the algorithms to detect very small deviations in copy number magnitude, or to detect only large deviations in magnitude. A useful default value for this parameter is supplied with the software, but in practice the user may have to adjust the value to obtain optimal results.
The purpose of this master thesis project will be to investigate some strategies for automatic selection of the smoothing parameter. The strategies involve the optimization of a suitable criterion, and a discussion of the merits and disadvantages of the different criteria will be a part of the project. In addition, the segmentation may have to be performed many times with different parameter settings to explore the landscape of possible models; this can be very time consuming and represents a major obstacle against the routine use of such approaches. Finding ways to speed up this process (e.g. through suitable pruning strategies) would seriously affect their applicability.
This project requires no prior knowledge of biology, but some knowledge (or interest) in statistics or mathematics is beneficial.