Sentiment Analysis for Norwegian Text

The SANT project aims to create training data and machine-learned models for Sentiment Analysis for Norwegian Text. While coordinated by the Language Technology Group at IFI/UiO, collaborating partners include NRK, Schibsted and Aller Media.

Sentiment Analysis (SA)

Image may contain: Circle, Smile, Font.One of the applications of Language Technology (LT) that has gained most widespread use in recent years is so-called opinion mining or sentiment analysis (SA). In broad terms, SA is the task of automatically identifying opinions, attitudes or emotions that are expressed by subjective information in text. 

SANT

The goal of SANT is to create open resources for sentiment analysis for Norwegian. The project represents a collaboration between the Language Technology Group (LTG) at the Department of Informatics at the University of Oslo, and three of Norway's largest media groups; the public broadcaster NRK/P3 and the privately held Schibsted Media Group and Aller Media. The media partners provide data in the form of reviews, collected across a range of different domains; music, literature, restaurants, home electronics, and more. As reviews by definition are packed with subjective opinions and evaluations, they're ideally suited for sentiment analysis.

Below we describe some the resources created in the project so far.

Document-level SA: Reviews as training data

The Norwegian Review Corpus (NoReC) is a collection of reviews across a wide range of domains. We here suggest taking advantage of a peculiarity of the way reviews and critiques are typically summarized in Norwegian arts- and consumer journalism, viz. by an explicit rating on a scale 1–6, represented as a throw of a die. Treating these ratings as labels of overall text polarity, we can train and evaluate machine-learned models for sentiment analysis on the document-level. The corpus and further documentation is available here:
https://github.com/ltgoslo/norec 

Fine-grained SA

For some applications, however, it is desirable to have models that can make more granular predictions at the (sub-)sentence-level, by identifying the individual polar expressions as well as the targets and holders of the opinions. To enable such models, a subset of the review corpus has been manually annotated with fine-grained og `structured' in-sentence polarity information, resulting in a dataset dubbed NoReCfine:
https://github.com/ltgoslo/norec_fine

We have also created a simplified version of the data set above that allows for training sentence-level polarity classifiers (positive/negative/neutral):
https://github.com/ltgoslo/norec_sentence

Sentiment lexicon

A sentiment lexicon is simply a list of potentially sentiment bearing words and their prior positive/negative polarity. While such context independent polarity values will obviously have several shortcomings, the simplicity and transparency of lexicon-based approaches to SA still makes them attractive for many applications. The following repo contains a Norwegian sentiment lexicon semi-automatically created on the basis of the English lexicon generated by Hu and Liu (2004):
https://github.com/ltgoslo/norsentlex

Financing

The project is granted funding from the RCN's IKTPLUSS initiative until fall 2022.
 

 

Tags: Sentiment Analysis, Language Technology, Natural Language Processing, Machine Learning, NLP, AI, Artificial intelligence, deep learning, data science
Published June 12, 2017 11:12 PM - Last modified May 12, 2023 3:57 PM

Contact

Participants

Detailed list of participants
News