Developing a software tool to assess the similarity of simulated data and real-world data for evaluating the performance of machine learning methods.
Background and project description
Machine learning (ML) methods are often evaluated on simulated/synthetic data. Ideally the simulated data should mimic real-world data in terms of several characteristics and feature distributions. When using simulated data for evaluating the performance of ML methodologies, it is therefore important to assess the similarity of simulated data and real-world data. There are two important considerations here: (a) Exploring and understanding which characteristics of the real-world data that the simulated data should recapture, and (b) automating and streamlining the assessment of similarity of simulated and real-world datasets through a software/tool in such a way that future users could assess the similarity of their datasets with a few clicks/commands. In this project, a particular combination of simulated data and real-world data from the biological domain will be used as a case. The student will develop a software with an interface of his/her choice.
How will this task be useful in future jobs
The topic of this particular task is an important consideration whenever evaluating a ML/statistical method on simulated data. Therefore, the thought process/resources developed through this task are useful/transferable skills for future jobs.
- Study programs: Data Science/Computational Science/Statistics/Informatics/Bioinformatics
- Skills: Decent grasp of statistics is assumed. Good programming skills in Python/R is assumed. No biology knowledge is required.