NLP in the news
This thesis will take place as part of the Machine Learning team at Schibsted Media, where you will work with some of the largest newspapers in Norway and Sweden.
The data are in Norwegian and Swedish, but being a native speaker is not necessary (though it will probably be more fun).
There are several options on the subject of the thesis. All these proposals are projects that our team has had on there wish list, as fun stuff to do, for a while. Thus, if successful, they may end up being put in front of millions of readers in the end:
- (Deep) Content representation for front page recommendations. We at Schibsted are working on personalising the front page of Aftenposten, and how we represent the textual content in the recommender models has a big impact on the end result. Models we want to try out include: 1) Unsupervised models (e.g. LDA, Doc2Vec), 2) supervised models trained on a different task (CNN, attention models, triplet learning).
- Named Entity Recognition and domain adaption in pre-training word embeddings. The idea here is to use the Norwegian/Swedish NER data we have available, and do domain adaption (both in terms of domain specific data, and vocabulary extension) to the news domain through pre-training embeddings on different datasets.
- Text summarisation, based on pairs of ingress and full text. Use a neural net and seq2seq modelling, and try to generate the ingress of a news article based on the body of the article.
- Natural language generation (NLG) for sports (VG/Aftonbladet). From structured event data (goals, free kicks, etc.) to game summary or live coverage. The data set here would consist of a database of structured data, and human-generated coverage of matches (from e.g. VG live). The approach could be rule/grammar/pattern based, or a neural net (trained end-to-end), or a combination.
The thesis will be supervised by Fredrik Jørgensen (Data Scientist at Schibsted), and optionally co-supervised by another member from the LTG group or Schibsted's ML team.