Cross-domain effects in sentiment analysis
Sentiment analysis (SA) is the task of detecting positive and negative opinions expressed in text and is one of the applications of Natural Language Processing that has found the most widespread use. SA can be carried out at many levels of granularity, e.g., the level of documents, sentences, or individual expressions. The SANT project has created resources for all these levels of analysis for Norwegian texts, based on arts- and consumer-reviews gathered from online news sources.
One important challenge in SA, as also faced by many other areas of NLP, is that of domain effects; models trained to perform well on data from one particular domain or genre will often see a drop in performance when applied to texts from another domain. The Norwegian Review Corpus (NoReC) contains more than 43,000 full-text reviews from a range of different domains, including literature, movies, video games, restaurants, music, products, etc. The original ratings assigned by the professional reviewers, on a scale of 1–6 (as represented by the dots of a die), can be used as labels for training supervised classifiers to predict document-level sentiment.
The focus of this project will be to investigate cross-domain effects in document-level sentiment analysis on NoReC (and potentially other comparable datasets for English or other languages). The project will also explore different ways of efficiently applying large-scale pre-trained transformer models like NorBERT to document classification. Other research questions include: How well do models trained on one particular domain transfer to other domains? How are the effects when we control for the amount of training data? Are some domains inherently more difficult than others? Can we quantify domain similarity in a way that correlates with transfer performance? Are some model architectures more robust to cross-domain effects than others?
Good programming skills, experience with machine learning, and a solid background in NLP are relevant qualifications.