Language technology group research seminar is a biweekly event. The regular time slot for the seminar is Monday, 12:15-13:15 CEST.
The seminar is conducted in a hybrid form: both offline at the room 4118 of Ole-Johan Dahls hus, UiO, and online in Zoom (link available by request). With questions or suggestions related to the LTG seminar, please contact David Samuel.
December 5, 2022
LTG research seminar has a long history, but this page starts from Fall 2021 only.
A lot of data fit nicely into a graph formalism. Social networks, biological systems, and maps can all be thought of as relationships between objects. The Graph Neural Network (GNN) architecture has become the most expressive way to do representation learning over such structures, providing a unified framework for generating embeddings for nodes, relations, and even entire graphs. These embeddings can be used for tasks like predicting fraudulent transactions in a financial network, drug discovery, and shortest path computation for services like Google Maps. The first half of this talk will introduce the most common GNN models, focusing on how they differ from "standard" NNs. The second half will present and discuss how GNNs are currently being used to solve tasks within NLP, particularly for integrating external knowledge for tasks like QA and language modeling.
November 7, 2022
Event extraction involves the detection and extraction of both the event triggers and corresponding event arguments. Existing systems often decompose event extraction into multiple subtasks, without considering their possible interactions. We propose EventGraph, a joint framework for event extraction, which encodes events as graphs. We represent event triggers and arguments as nodes in a semantic graph. Event extraction therefore becomes a graph parsing problem, which provides the following advantages: 1) performing event detection and argument extraction jointly; 2) detecting and extracting multiple events from a piece of text; and 3) capturing the complicated interaction between event arguments and triggers. Experimental results on ACE2005 show that our model is competitive to state-of-the-art systems and has substantially improved the results on argument extraction.
October 24, 2022
This PhD project will aim at developing adversarial models to text anonymization (i.e. models that re-identify masked identifiers). To do this we plan to use retrieval-based transformers/models and various databases that act as levels of background knowledge available to an attacker.
September 26, 2022
The completion of NARC (Norwegian Anaphora Resolution Corpus) is getting close. We present a corpus for coreference resolution and bridging for Norwegian. Our forthcoming publication focuses on the Bokmål part of the corpus, with basic corpus statistics and preliminary modelling results. The PoS-tagged tweets were made in order to evaluate commonly used PoS-taggers on informal Norwegian data in three categories: Bokmål, Nynorsk and dialectal tweets. The dataset is small, but allows us to highlight some problems seen when trying to PoS-tag informal text and written dialect.
June 13, 2022
Work in progress
We have recently created an exploratory dataset to find similarities and differences between Entity-Level Sentiment Analysis (ELSA) and other sentiment analysis tasks, in particular Targeted Sentiment Analysis (TSA). We see that ELSA can (partially) be derived from NER, coreferences and TSA, but error propagation is an issue. Our next steps are 1) To create annotation guidelines for a proper ELSA dataset, 2) Explore alternative, probably summarization-related approaches to ELSA. After the presentation we will have room for discussion and brainstorming.
May 16, 2022
Sardana Ivanova (University of Helsinki)
Paper 1, Paper 2
This presentation gives an overview on language technology tools for supporting low-resource languages, in particular, the Sakha language. Tools include: a morphological analyser, a computer-assisted language learning (CALL) platform, and two natural language generation (NLG) systems.
We extended an earlier, preliminary version of the morphological analyser, built on the Apertium rule-based machine translation platform. The transducer, developed using Helsinki Finite-State Toolkit (HFST), has coverage of solidly above 90%, and high precision. Based on the morphological analyser, we implemented a language learning environment for Sakha in the Revita CALL platform. Revita is a freely available online platform for learners beyond the beginner level.
Currently we have implemented two NLG systems for Finnish and a few other languages: a transformer-based poetry generation system and a template-based news generation system. We plan to extend those systems to support Sakha.
May 9, 2022
Modern task-oriented spoken dialogue systems often rely on dialogue management modules, which keep track of information for an autonomous dialogue agent to complete tasks. Dialogue systems are also frequently deployed with physical robotic agents for Human-Robot Interaction (HRI).
This seminar will detail completed and planned work in dialogue management and HRI which comprise the titular doctoral project as part of the third semester evaluation of the work. The present state-of-the-art and proposed graph-based approaches for dialogue management will be discussed in addition to methodological challenges. We will present an overview of the current theory, investigations into methodologies and prototypes, data collection, and an outline of future work for the project. Discussion will also include description of the work in the context of HRI and experiments with the project’s Pepper robot.
April 25, 2022
We describe NorDiaChange: the first diachronic semantic change dataset for Norwegian. NorDiaChange comprises two novel subsets, covering about 80 Norwegian nouns manually annotated with graded semantic change over time. Both datasets follow the same annotation procedure and can be used interchangeably as train and test splits for each other. NorDiaChange covers the time periods related to pre- and post-war events, oil and gas discovery in Norway, and technological developments. The annotation was done using the DURel framework and two large historical Norwegian corpora. NorDiaChange is published in full under a permissive license, complete with raw annotation data and inferred diachronic word usage graphs (DWUGs).
March 31, 2022 (12:30)
Natural language processing encompasses a spectrum of tasks whose goal, on a superficial level, is to structure the information contained in "raw" human language input. In this talk, we focus on meaning representation parsing -- i.e. mapping from natural language utterances to graph-based encodings of semantic structure. As the halfway-point progress report (third semester evaluation) of the titular doctoral project, we will give an overview of the methodological challenges of the task, review the current state-of-the-art, and summarise completed and ongoing work that comprises the project. This includes an in-depth dive into different meaning representation frameworks, parsing architectures, diagnostic evaluation of systems, and framework-specific error analysis, as well as a look forward to (currently) unsolved challenges in model development, e.g. multitask learning for cross-framework and cross-lingual parsing.
March 14, 2022
Speakers are thought to use efficient information transmission strategies for effective communication. For example, they transmit information at a constant rate in written text and they use repetitions extensively in spoken dialogue. We analyze these strategies in monologue and dialogue datasets, combining information-theoretic measures with probability estimates obtained from Transformer-based language models.
We find (i) that information density decreases overall in spoken open domain and written task-oriented dialogues, while it remains uniform in written texts; (ii) that speakers’ choices are oriented towards global, rather than local, uniformity of information; (iii) that uniform information density strategies are at play in dialogue when we zoom in on topically and referentially coherent contextual units; (iv) and that repetitions of non-topical and non-referential expressions, too, can be interpreted as an efficient production strategy.
Besides providing new empirical evidence on written and spoken language production, we believe that our studies can directly inform the development of more human-like natural language generation models.
February 28, 2022
Internet protocols are standardized in the Internet Engineering Task Force (IETF). The standardization process in this organization involves a large amount of freely accessible textual artifacts: discussions in mailing lists, on GitHub, as well as recorded meeting minutes of online and in-person meetings. The results are “RFCs” - prose documents which lend themselves to NLP analysis just as well as the textual body that precedes their finalization. This talk will introduce the IETF process, the available means to access the text archives, and share some ideas on NLP analyses that might be useful to the Internet community or computer scientists in general.
February 14, 2022
What does BERT learn from the current Natural Language Understanding datasets - verbal reasoning skills, or shallow heuristics? This talk discusses available evidence and presents a case study of generalization in NLI (from MNLI to the adversarially constructed HANS dataset) in a range of BERT-based architectures (adapters, Siamese Transformers, HEX debiasing), as well as with subsampling the data and increasing the model size. Most strategies are unsuccessful, but they are all providing insights into how Transformer-based models learn to generalize.
January 31, 2022
Historical ciphers and keys represent a rich source of information that can provide great insights into our past. The main drawback, however, is that such sources are spread out in archives all over the world, which makes it rather difficult to analyze and compare various manuscripts.
The DECRYPT project started out as an effort to address this issue by building a reliable and comprehensive system that aims to make such sources easily available to the general public. This interdisciplinary endeavor brings together historians, linguists, cryptographers and programmers who work together to develop tools that can facilitate the automatic analysis of encoded manuscripts.
January 17, 2022
Targeted Sentiment Analysis aims at for each sentence to detect the words representing that what is spoken positively or negatively about. In the sentence "I admire my dog", "my dog" is spoken positively about. In TSA, we do not include finding the holder/source "I", or the words that express this positivity or negativity, like "admire".
I have done TSA on the Norwegian NoReC-fine dataset. I have compared different word embeddings to use with LSTM, and compared with some pretrained BERT-related models. NoReC-fine consists of newspaper reviews of topics from various domains. We look at the cross-domain effect on the results, and compare them with cross-lingual experiments with same-domain data.
December 13, 2021
SemEval-2021 Task 2: Multilingual and Crosslingual Word-in-Context Disambiguation (MCL-WiC) is proposed as a benchmark to evaluate context-sensitive word representations. Our main interest is to investigate the usefulness of pre-trained multilingual language models (LMs) in this MCL-WiC task, without resorting to sense inventories, dictionaries, or other resources. As our main method, we fine-tune the language models with a span classification head. We also experiment with using the multilingual language models as feature extractors, extracting contextual embeddings for the target word. We compare three different LMs: XLM-RoBERTa (XLMR), multilingual BERT (mBERT) and multilingual distilled BERT (mDistilBERT).
We find that fine-tuning is better than feature extraction. XLMR performs better than mBERT in the cross-lingual setting both with fine-tuning and feature extraction, whereas these two models give a similar performance in the multilingual setting. mDistilBERT performs poorly with fine-tuning but gives similar results to the other models when used as a feature extractor.
November 29, 2021
Wind power has with technological development and cost decrease as of lately become profitable in Norway even without any subsidies, which have led to an increase in new developments across the country. These new developments have not always been welcomed and as development continued to increase, so did the opposition. In addition to that the distribution of benefits and burdens of energy infrastructure and energy policies should be fair and equitable, transitioning to a low-carbon society or energy system requires support for political decisions and policies.
A traditional way to collect information on public opinions is via questionnaires, surveys or interviews. These methods may however be prone to selection bias and response bias, as well as missing data and incomplete information. There is as such a value in exploring alternative methods to acquire information on public opinion. In this study, we follow the work of Kim et al. and assess the public sentiment in Norway towards on- and offshore wind energy via a machine learning approach for natural language processing based on data scrapping from social media sites such as Twitter. We collected about 70 000 Norwegian tweets which we manually annotated a subset of and fed into NorBERT for fine-tuning. We then used the fine-tuned model to classify the rest of the tweets.
November 8, 2021
We present a new approach to text anonymization, one that moves past a NER task and incorporates disclosure risk in the process, combining NLP and PPDP. Making use of Wikipedia biographies and background knowledge from Wikidata we propose an automatic annotation method based on k-anonymity that can produce large amounts of labeled data for sensitive information. We train two BERT models on these data following two different approaches of picking sensitive terms to mask. We also manually annotate and release a sample of a 1000 article summaries, and use it to check the performance of our models.
October 25, 2021
October 11, 2021
"Improving Multilingual Lexical Normalization by Fine-tuning ByT5"
David Samuel (LTG)
We present the winning entry to the Multilingual Lexical Normalization (MultiLexNorm) shared task at W-NUT 2021, which evaluates lexical normalization systems on 12 social media datasets in 11 languages.
Our system is based on a pre-trained byte-level language model, ByT5, which we further pre-train on synthetic data and then fine-tune on authentic normalization data. It achieves the best performance by a wide margin in intrinsic evaluation, and also the best performance in extrinsic evaluation through dependency parsing. We release both the source code and the fine-tuned models.
September 27, 2021
Semantics, morphology and syntax are strongly interdependent. However, the majority of computational methods for semantic change detection use distributional word representations which encode mostly semantics. We investigate an alternative method, grammatical profiling, based entirely on changes in the morphosyntactic behaviour of words.
We demonstrate that it can be used for semantic change detection and even outperforms some distributional semantic methods. We present an in-depth qualitative and quantitative analysis of the predictions made by our grammatical profiling system, showing that they are plausible and interpretable.