Developing machine learning methodology/software to better understand how the size of available training data affects prediction performance in real-world settings
Background and project description
The performance of machine learning (ML) methods depends upon the size of the datasets available for training: the performance improves with increase in sample size and saturates at a certain point. If one trains and applies a ML method in a real-world setting, often one cannot readily explain to what extent the observed performance of the ML method was a consequence of the training dataset size. This issue could be a challenge particularly in some domains, where larger dataset sizes are not common owing to challenges with the collection/generation of data. Therefore, it would be useful to develop a methodology and/or software to estimate the sample size required for attaining a desired level of prediction performance of machine learning methods, given a particular feature representation of the dataset and ML method of choice. In this project, a particular combination of feature representation + ML method choice suitable for a dataset in the biology domain will be used as a case.
How will this task be useful in future jobs
This particular task hones transferable skills in terms of thought process+practical skills that can be useful when applying ML methods in different domains in future.
- Study programs: Data Science/Computational Science/Statistics/Informatics/Bioinformatics
- Skills: Good grasp of statistics/machine learning is assumed. Decent programming skills in Python/R is assumed. No biology knowledge is required.