Neural compound analysis for Norwegian

Compounds are a characteristic of most Germanic languages and they are also prevalent and highly productive in Norwegian, e.g. designbutikk `designer store', høstfarger `fall colours'. There is, however, no publicly available system for automated compound analysis (which usually involves detection and subsequent splitting of compounds). While the manually annotated Norwegian Dependency Treebank (NDT) marks compounds explicitly, however, does not indicate how they should be split, e.g. design+butikk.
This thesis will explore the use of neural methods for analysis of Norwegian compounds. Since there is no existing data set for compound splitting, an important part of the thesis work will be to manually annotate such a dataset and make use of this in further experimentation.

The project is well suited to be carried out by two students, but can also be carried out as an individual project with a somewhat narrower focus, for instance, by focusing only on compound detection (using NDT) or experimenting with unsupervised techniques for compound splitting.

The project requires a balance of technical and linguistic expertise. Good programming skills, experience with machine learning and a solid background in NLP are relevant qualifications. Please contact the supervisors to discuss further details.

Emneord: NLP, Machine Learning, language technology
Publisert 7. okt. 2019 22:39 - Sist endret 7. okt. 2019 22:39

Omfang (studiepoeng)