Sentiment Analysis for Norwegian Text
The SANT project aims to create training data and machine-learned models for Sentiment Analysis for Norwegian Text. While coordinated by the Language Technology Group at IFI/UiO, collaborating partners include NRK, Schibsted and Aller Media.
Sentiment Analysis (SA)
One of the applications of Language Technology (LT) that has gained most widespread use in recent years is so-called opinion mining or sentiment analysis (SA). In broad terms, SA is the task of automatically identifying opinions, attitudes or emotions that are expressed by subjective information in text.
The goal of SANT is to create open resources for sentiment analysis for Norwegian. The project represents a collaboration between the Language Technology Group (LTG) at the Department of Informatics at the University of Oslo, and three of Norway's largest media groups; the public broadcaster NRK/P3 and the privately held Schibsted Media Group and Aller Media. The media partners provide data in the form of reviews, collected across a range of different domains; music, literature, restaurants, home electronics, and more. As reviews by definition are packed with subjective opinions and evaluations, they're ideally suited for sentiment analysis.
Below we describe some the resources created in the project so far.
Document-level SA: Reviews as training data
The Norwegian Review Corpus (NoReC) is a collection of reviews across a wide range of domains. We here suggest taking advantage of a peculiarity of the way reviews and critiques are typically summarized in Norwegian arts- and consumer journalism, viz. by an explicit rating on a scale 1–6, represented as a throw of a die. Treating these ratings as labels of overall text polarity, we can train and evaluate machine-learned models for sentiment analysis on the document-level. The corpus and further documentation is available here:
For some applications, however, it is desirable to have models that can make more granular predictions at the (sub-)sentence-level, by identifying the individual polar expressions as well as the targets and holders of the opinions. To enable such models, a subset of the review corpus has been manually annotated with fine-grained og `structured' in-sentence polarity information, resulting in a dataset dubbed NoReCfine:
We have also created a simplified version of the data set above that allows for training sentence-level polarity classifiers (positive/negative/neutral):
A sentiment lexicon is simply a list of potentially sentiment bearing words and their prior positive/negative polarity. While such context independent polarity values will obviously have several shortcomings, the simplicity and transparency of lexicon-based approaches to SA still makes them attractive for many applications. The following repo contains a Norwegian sentiment lexicon semi-automatically created on the basis of the English lexicon generated by Hu and Liu (2004):
The project is granted funding from the RCN's IKTPLUSS initiative until fall 2022.