Developing machine learning methodology/software to better understand how the size of available training data affects prediction performance in real-world settings

Background and project description

The performance of machine learning (ML) methods depends upon the size of the datasets available for training: the performance improves with increase in sample size and saturates at a certain point. If one trains and applies a ML method in a real-world setting, often one cannot readily explain to what extent the observed performance of the ML method was a consequence of the training dataset size. This issue could be a challenge particularly in some domains, where larger dataset sizes are not common owing to challenges with the collection/generation of data. Therefore, it would be useful to develop a methodology and/or software to estimate the sample size required for attaining a desired level of prediction performance of machine learning methods, given a particular feature representation of the dataset and ML method of choice. In this project, a particular combination of feature representation + ML method choice suitable for a dataset in the biology domain will be used as a case.

How will this task be useful in future jobs

This particular task hones transferable skills in terms of thought process+practical skills that can be useful when applying ML methods in different domains in future.

Required background

  • Study programs: Data Science/Computational Science/Statistics/Informatics/Bioinformatics
  • Skills: Good grasp of statistics/machine learning is assumed. Decent programming skills in Python/R is assumed. No biology knowledge is required. 
Publisert 24. sep. 2021 13:03 - Sist endret 24. sep. 2021 13:03

Omfang (studiepoeng)