Talk of Norway
In this ongoing cross-disciplinary collaboration, researchers in Language Technology (LT) and Political Science (PS) are applying supervised and unsupervised machine learning methods to data from the Norwegian parliament in order to gather knowledge spanning across different dimensions.
The Talk of Norway (ToN) data set is a collection of 250373 Speeches from the Norwegian Parliament from 1998 to 2016. The speeches are annotated with a rich set of 83 metadata variables. The speeches are annotated with sentence and token boundaries, lemmas, parts-of-speech and morphological features.
The raw speeches and meta data pulled from the Storting API, with additional information scraped from the Storting homepage and integrated with data from other sources.
Automatic language identification and morphological analysis of the speeches was obtained with langid.py and OBT as implemented in the CLARINO Language Analysis Portal.
The 1.0 version of the data is available at the talk-of-norway github repository.
In terms of classical quantitative PS, we are interested in measuring the political coherence of speeches in each party, observe how this differs across and within the parties their selves, and whether there are noticeable changes in significant periods, for instance during electoral campaigns or when a given party supports or opposes the government. In terms of NLP research, we are interested in exploring different system development and feature extraction strategies in order to deliver the best possible analyses in this context. We are also interested in experimenting with linguistic features to evaluate the contribution of hidden layers of information to this task. Finally, in the context of the digital humanities-focused CLARINO initiative, this project serves as a use-case and platform for further development and testing of the Language Analysis Portal, by taking advantage of its stack of language analysis tools, High Performance Computing (HPC) facilities and its flexible data model.