Oppgaven er ikke lenger tilgjengelig

Solving SuperGLUE in a smart way

Over the last years, deep learning NLP architectures have obtained rapidly increasing scores on language understanding benchmarks like SuperGLUE or XTREME. SuperGLUE is a benchmark for evaluating general-purpose language understanding systems. The set of eight tasks in SuperGLUE emphasizes diverse task formats: sentiment analysis, linguistic acceptability judgments, entailment detection, semantic similarity of words in context, etc. Overall, it is a rich and challenging testbed for work developing new general-purpose machine learning methods for language understanding in English.

Fair comparison

It is assumed that the models which are successful on such a benchmark should be able to generalize to other language-related tasks: pretty much like us humans. Currently, the SuperGLUE leaderboard is dominated by the models trained with the Transformer architecture (and the like). However, models are often trained on different corpora (both in size and in genre). This makes it difficult to assess why the X model outperforms the Y model: is it because of a better architecture or because of a larger training corpus? Thus, there is a strong line of critique towards this way of comparing: we should at least standardize the training corpus across all models. Researchers have tried to compare the performance of the models trained on corpora of different sizes, but what about corpora of different nature (domains)?

Benchmark lottery

Standardized benchmarks represent a collection of datasets that address various aspects of evaluating language model generalization abilities. However, these benchmarks are known to be fragile in their nature, meaning that the overall model performance heavily depends on the choice of the benchmark tasks (Dehghani et al., 2021). At the same time, it is unclear how well the construction of these benchmarks aligns with research claims of human-level performance (Raji et al., 2021). How common are task and instance selection biases in such standardized NLP benchmarks for English and/or Norwegian, and what can we do about it?

Using heuristics instead of machine learning

In addition, it has long been noticed that deep neural models sometimes solve benchmarks like SuperGLUE by relying on statistical irregularities in the datasets, instead of really learning to understand language. Sometimes it is so pronounced that the task of finding a logical contradiction between two sentences can be solved by simply looking for a specific word in one of the sentences (for example, a negation marker like "no").

Therefore, an interesting question arises: to what degree can SuperGLUE be solved by the exploitation of simple heuristics? This part of the thesis involves only linguistic analysis of language data and trying hand-crafted heuristics, without any machine learning at all. Can we choose one alternative sentence out of two given a premise sentence by simply looking at their lexical overlap? Can we answer yes/no questions by simply comparing syntactic subjects of the question and the passage text? Etc, etc. It was recently shown that this is indeed the case for Russian SuperGLUE. Is it possible to do the same for the English benchmark? Let's find it out!

Prerequisites: programming skills (Python), linguistic curiosity.

Recommended reading: see the hyperlinks to the papers in the text above.

Emneord: natural language processing, computational linguistics, superGLUE, NLP benchmarks, deep learning

Publisert 22. sep. 2022 21:40 - Sist endret 28. mai 2024 19:03

Veileder(e)

Andrei Kutuzov Universitetet i Oslo
Vladislav Mikhailov Universitetet i Oslo

Student(er)

Dans Reinicans