LTG research seminar

NLP researchers both from and outside LTG are presenting their findings in an informal environment, followed by questions and discussions.

LTG research seminar

Language technology group research seminar is a biweekly event.  The regular time slot for the seminar is Monday, 12:15-13:15 CEST.

In the Spring of 2022, the seminar is conducted in a hybrid form: both offline at the room 4118 of Ole-Johan Dahls hus, UiO, and online in Zoom (link available by request). With questions or suggestions related to the LTG seminar, please contact Andrey Kutuzov.

Forthcoming talks

January 17, 2022

Targeted Sentiment Analysis (TSA), and how to make the best use of your data

Targeted Sentiment Analysis aims at for each sentence to detect the words representing that what is spoken positively or negatively about. In the sentence "I admire my dog", "my dog" is spoken positively about. In TSA, we do not include finding the holder/source "I", or the words that express this positivity or negativity, like "admire".

I have done TSA on the Norwegian NoReC-fine dataset. I have compared different word embeddings to use with LSTM, and compared with some pretrained BERT-related models. NoReC-fine consists of newspaper reviews of topics from various domains. We look at the cross-domain effect on the results, and compare them with cross-lingual experiments with same-domain data.

Past talks

LTG research seminar has a long history, but this page starts from Fall 2021 only.

December 13, 2021

Multilingual Language Models for Fine-tuning and Feature Extraction in Word-in-Context Disambiguation

SemEval-2021 Task 2: Multilingual and Crosslingual Word-in-Context Disambiguation (MCL-WiC) is proposed as a benchmark to evaluate context-sensitive word representations. Our main interest is to investigate the usefulness of pre-trained multilingual language models (LMs) in this MCL-WiC task, without resorting to sense inventories, dictionaries, or other resources. As our main method, we fine-tune the language models with a span classification head. We also experiment with using the multilingual language models as feature extractors, extracting contextual embeddings for the target word. We compare three different LMs: XLM-RoBERTa (XLMR), multilingual BERT (mBERT) and multilingual distilled BERT (mDistilBERT).

We find that fine-tuning is better than feature extraction. XLMR performs better than mBERT in the cross-lingual setting both with fine-tuning and feature extraction, whereas these two models give a similar performance in the multilingual setting. mDistilBERT performs poorly with fine-tuning but gives similar results to the other models when used as a feature extractor.

November 29, 2021

Analyzing public sentiment towards wind energy in Norway
The work was done within this project

Wind power has with technological development and cost decrease as of lately become profitable in Norway even without any subsidies, which have led to an increase in new developments across the country. These new developments have not always been welcomed and as development continued to increase, so did the opposition. In addition to that the distribution of benefits and burdens of energy infrastructure and energy policies should be fair and equitable, transitioning to a low-carbon society or energy system requires support for political decisions and policies. 

A traditional way to collect information on public opinions is via questionnaires, surveys or interviews. These methods may however be prone to selection bias and response bias, as well as missing data and incomplete information. There is as such a value in exploring alternative methods to acquire information on public opinion. In this study, we follow the work of  Kim et al.[2020] and assess the public sentiment in Norway towards on- and offshore wind energy via a machine learning approach for natural language processing based on data scrapping from social media sites such as Twitter. We collected about 70 000 Norwegian tweets which we manually annotated a subset of and fed into NorBERT for fine-tuning. We then used the fine-tuned model to classify the rest of the tweets.

November 8, 2021

Text anonymization with explicit measures of disclosure risk

We present a new approach to text anonymization, one that moves past a NER task and incorporates disclosure risk in the process, combining NLP and PPDP. Making use of Wikipedia biographies and background knowledge from Wikidata we propose an automatic annotation method based on k-anonymity that can produce large amounts of labeled data for sensitive information. We train two BERT models on these data following two different approaches of picking sensitive terms to mask. We also manually annotate and release a sample of a 1000 article summaries, and use it to check the performance of our models.

October 25, 2021

What Quantifying Word Order Freedom Reveals about Dependency Corpora
Maja Buljan (LTG)
Slides
 
This is an overview of ongoing work on word order freedom and syntactic annotation, with the goal of differentiating between findings that reveal inherent properties of languages vs. features dependent on annotation styles. Following previous work on defining a quantifiable and linguistically interpretable measure of word order freedom in language, we take a closer look at the robustness of the basic measure (word order entropy) to variations in dependency corpora used in the analysis. We compare measures at three levels of generality, applied to treebanks annotated according to the Universal Dependencies v1 and v2 guidelines, spanning 31 language. Preliminary results show that certain measures, such as subject-object order freedom, are sensitive to changes in annotation guidelines, highlighting aspects of these metrics that should be taken into consideration when using dependency corpora for linguistic analysis.

October 11, 2021

"Improving Multilingual Lexical Normalization by Fine-tuning ByT5"
David Samuel (LTG)

Slides, paper

We present the winning entry to the Multilingual Lexical Normalization (MultiLexNorm) shared task at W-NUT 2021, which evaluates lexical normalization systems on 12 social media datasets in 11 languages.

Our system is based on a pre-trained byte-level language model, ByT5, which we further pre-train on synthetic data and then fine-tune on authentic normalization data. It achieves the best performance by a wide margin in intrinsic evaluation, and also the best performance in extrinsic evaluation through dependency parsing. We release both the source code and the fine-tuned models.

September 27, 2021

"Grammatical Profiling for Semantic Change Detection"
Andrey Kutuzov (LTG)
https://arxiv.org/abs/2109.10397

Slides

Semantics, morphology and syntax are strongly interdependent. However, the majority of computational methods for semantic change detection use distributional word representations which encode mostly semantics. We investigate an alternative method, grammatical profiling, based entirely on changes in the morphosyntactic behaviour of words.

We demonstrate that it can be used for semantic change detection and even outperforms some distributional semantic methods. We present an in-depth qualitative and quantitative analysis of the predictions made by our grammatical profiling system, showing that they are plausible and interpretable.

Tags: language technology, Natural Language Processing, Computational Linguistics, Seminar
Published Sep. 26, 2021 5:19 PM - Last modified Jan. 14, 2022 2:22 PM