Solving SuperGLUE in a smart way


Over  the  last years,  deep  learning NLP architectures have  obtained  rapidly  increasing  scores  on  language understanding benchmarks like SuperGLUE. SuperGLUE is a benchmark for evaluating general-purpose language understanding systems. The set of eight tasks in SuperGLUE emphasizes diverse task formats: sentiment analysis, linguistic acceptability judgments, entailment detection, semantic similarity of words in context, etc. Overall, it is a rich and challenging testbed for work developing new general-purpose machine learning methods for language understanding in English.

Fair comparison

It is assumed that the models which are successful on such a benchmark should be able to generalize to other language-related tasks: pretty much like us humans. Currently, the SuperGLUE leaderboard is dominated by the models trained with the Transformer architecture (and the like). However, models are often trained on different corpora (both in size and in genre). This makes it difficult to assess why the X model outperforms the Y model: is it because of a better architecture or because of a larger training corpus? Thus, there is a strong line of critique towards this way of comparing: we should at least standardize the training corpus across all models. This is the primary aim of this Master thesis.

In particular, it is interesting to evaluate the real difference in performance between the models based on Transformers (like BERT or RoBERTa) and the models based on deep recurrent networks (like ELMo). For a fair comparison, we are going to train our own models on the same training corpus, making their efficiencies really comparable. We will take into account not only the scores, but also computational requirements for training and inference (if the X model outperforms the Y model by 0.002%, while being 10x slower to train and use, can we really say that X is superior?). It is also important to analyze the detailed breakdown of performance by task and linguistic phenomenon: we would like to known what architecture is better in what task and why.

Using heuristics instead of machine learning

In addition, it has long been noticed that deep neural models sometimes solve benchmarks like SuperGLUE by relying on statistical irregularities in the datasets, instead of really learning to understand language. Sometimes it is so pronounced that the task of finding a logical contradiction between two sentences can be solved by simply looking for a specific word in one of the sentences (for example, a negation marker like "no").

Therefore, an interesting question arises: to what degree can SuperGLUE be solved by the exploitation of simple heuristics? This second part of the thesis involves only linguistic analysis of language data and trying hand-crafted heuristics, without any machine learning at all. Can we choose one alternative sentence out of two given a premise sentence by simply looking at their lexical overlap? Can we answer yes/no questions by simply comparing syntactic subjects of the question and the passage text? Etc, etc. It was recently shown that this is indeed the case for Russian SuperGLUE. Is it possible to do the same for the English benchmark? Let's find it out!

Prerequisites: programming skills (Python), linguistic curiosity.

Recommended reading: see the hyperlinks to the papers in the text above.

Emneord: natural language processing, computational linguistics, superGLUE, NLP benchmarks, deep learning
Publisert 13. okt. 2021 17:18 - Sist endret 13. okt. 2021 17:18


Omfang (studiepoeng)