Quality scoring of alignments for movie & TV subtitles
Movie and TV subtitles (such as OpenSubtitles in OPUS) are a highly valuable resource for the compilation of parallel corpora thanks to their availability in large numbers and across many languages. However, the quality of the resulting sentence alignments is often much lower than for other parallel corpora.
It would be very useful to be able to associate quality scores to the sentence pairs of these corpora, such that would could filter out low-quality pairs. A first (not entirely successful) attempt at this was presented at LREC 2018. The master thesis will further develop such quality scoring models and evaluate their impact on downstream tasks such as machine translation and conversation modelling.
Prerequisites: Some competence in statistical modelling, good programming skills. We are going to work with very large datasets, so some basic knowledge of high-performance computing is an advantage.