Highly dynamic clustering
We have implemented a large open-source software system for statistical analysis of genome data (DNA). A range of different genome properties and different statistical analyses can be selected through a simple web interface, allowing biologists to perform advanced analysis with little effort. The system currently supports analyses of different forms of interdependencies between pairs of genome properties.
The task is to implement a highly dynamic clustering functionality that supports complex and precisely customized clustering to be performed on genome data through a simple interface. The idea is to develop a system that allows a user to not only select which data sets (genome properties) to cluster, but also to precisely determine the notion of similarity to be used in the clustering. This can be achieved by building the clustering functionality on top of the existing functionality, using (customizable) pair-wise comparisons as similarity measure. The distance (similarity) between two genome properties can either be based directly on a pair-wise comparison between the properties, or it can be based on similarities in how these two properties relates to a third reference property. The end result should be a clustering system, integrated with our open-source genome analysis system, that allows biologists to perform a highly customized and advanced clustering by simply selecting a few intuitive parameters in a web interface.
Students should have a background in machine learning or algorithms. No prior knowledge of biology is needed.