Keyphrase extraction from reviews (in Norwegian)
Online reviews are a gold mine for users to get an overview of options to help them choose which product to purchase, or if a film is worthwhile watching. Reviews usually shape the users' expectations, and can uncover some aspects of a product that the user hasn’t thought of. However, the wide range of reviews available online can make it difficult to digest them all. Sometimes, the length of a review can also discourage a user from reading it. Which review should the user read? Which aspects of the product are being reviewed? Some automated approaches have been proposed in the literature, aiming at reducing the user’s efforts in reading reviews. One applied example is provided by Booking.com. The user is not only presented with the total scores given by guests of a given hotel, but also a list of what "guests love" about the hotel (see here for an example).
This project seeks to implement similar functionality for Norwegian. Concretely, the goal will be to automatically extract the most salient keyphrases related a product (products, films, show, etc.) for a given review. The keyphrases can be seen as a short summary representation or a condensed overview of the content of each review. If time permits, the project will also seek to classify the extracted keyphrases in terms of sentiment or positive/negative polarity (see wikipedia for a brief introduction to sentiment analysis). The data used for this project will be the newly released Norwegian Review Corpus (NoReC), comprising over 35.000 reviews from different domains with ratings on a scale of 1–6.
For a survey on current techniques used for keyphrase extraction, see Hasan and Ng 2014.