Dataset for sentence-level polarity

We have just released a new dataset for modeling sentence-level polarity for Norwegian: NoReC_sentence

The previously released NoReC_fine dataset annotates fine-grained sentiment information for Norwegian, including target expressions, polar expressions, intensity and holders. For some applications, however, it may be more convenient to predict sentence-level polarity instead. While a greatly simplified task, sentence-level polarity prediction has been widely used within NLP for quick model benchmarking in particular, for example based on well-known English datasets like SST.

In the newly released dataset NoReC_sentence, we have aggregated the fine-grained annotations to the sentence-level in two ways:

Binary: includes the subset of sentences that contained sentiment annotations of either positive or negative polarity (but not both).
Three-way: additionaly includes sentences annotated as having no sentiment at all (neutral).

Note that for both binary and three-way, sentences that contained mixed polarity are excluded.

The dataset has already been used for benchmarking large-scale contextualized language models for Norwegian like NorBERT and NoTraM, as documented in forthcoming NoDaLiDa publications.

Image may contain: Photograph, Facial expression, Light, Wood, Font.

Tags: sentiment analysis, sentimentanalyse, NLP, NLP benchmarks, språkteknologi

Published Apr. 8, 2021 4:15 PM - Last modified Apr. 8, 2021 4:17 PM