Using Multiple Instance Learning to find patterns in Immune Receptor Sequences

In a typical supervised machine learning setting, each sample in the training data has a label that tells us (and the machine) which class the sample belongs to. This allows modern machine learning methods to successfully learn even very complex patterns that separates samples from the different classes in a broad variety of settings. But what happens if we don’t know the class of each individual sample, but only a summary of the classes for groups of samples? For example: We only have labels for groups of ten samples telling us whether at least one of the samples is from the positive class.  
Learning to separate between positive and negative samples from such weakly labeled datasets is difficult and known as Multiple Instance Learning (MIL). Our goal is to use Multiple Instance Learning to predict whether a patient has a disease or not based on 100,000s of immune receptor sequences gathered from the patient, where only a few of the receptors are relevant for the disease label. In order to do this we need to find out how different MIL methods perform for varying group sizes, number of samples, model complexities, causal structures and other problem parameters. 
The goal of this master thesis project is to use simulated data and/or mathematical derivations to figure out this challenge, and apply the most suited methods to our large data sets of immune receptor sequences. 
Emneord: Machine Learning
Publisert 5. okt. 2021 20:09 - Sist endret 5. okt. 2021 20:10

Omfang (studiepoeng)