Graph representations of semantic vectors
Graph representations of word embedding models: evaluating existing approaches and possible applications in language data processing and visualization.
Distributional semantic vector models (or word embeddings) are trained on large text corpora and aimed to assign words semantically meaningful dense vectors. Conceptually, a word embedding model with vector size n is a matrix of n-dimensional vectors, where the number of rows is equal to the number of words in the corpus lexicon. Semantic similarity between words is measured as cosine distance between the corresponding vectors. For many applications, this matrix or parts of it can be represented as a graph – see e.g. Gyllensten & Sahlgren (2015). Such graphs can be useful for word sense induction, word clustering, discovering words central to semantic networks, and last but not least – for creating attractive and insightful visualizations of semantic relations between words (example from the Dilipad project).
However, there are more than one way of transforming distributional data from a matrix representation into a graph representation. The most obvious approach is to create a fully connected graph with words as nodes and weights on edges equal to cosine similarity between word pairs. However, for large corpora (millions of word types in the lexicon) this can be prohibitively costly in terms of memory and processing time. Thus, there are ways to to approximate these semantic similarities without making the graph fully connected (for example, draw an edge only if cosine similarity between two words is higher than a pre-defined threshold). The aim of this project is to evaluate these approaches in a variety of downstream tasks (primarily on English material) as well as exploring how good they are for visualizations. Basic knowledge of graph theory is recommended, however, in case of interest a quick crash course can be given.