Semantic word clustering for statistical parsing
Clustering is an unsupervised machine learning technique for automatically forming groups of similar observations in a data set. In natural language processing, clustering is often applied to large text corpora to form semantic classes of words. Recently there has been increased interest in using such word clusters as an additional source of information to improve the performance of statistical parsers.
Another area that has seen much interest recently is the use of so-called word embeddings; distributional models trained on very large data sets to efficiently model the similarities of words (see for example Goggle's word2vec toolkit).
In this project the idea is to combine several of these trends, by first performing clustering on word embedding representations and then use this to improve the performance of statistical dependency parsers. We will aim to perform experiments on both Norwegian and English data sets, and parsing will be performed using the MaltParser toolkit.