New corpus release: OpenSubtitles 2016

Together with Jörg Tiedemann (University of Helsinki), the Language Technology Group just released a major update of "OpenSubtitles", a collection of parallel corpora extracted from movie and TV subtitles:

The dataset contains 2.8 million subtitle files in 60 languages for a total of over 17 billion tokens in 2.6 billion sentences, making it the world's largest multilingual corpus (as far as we know). See our paper at LREC for more details on the corpus.

By Pierre Lison
Published Mar. 16, 2016 12:16 AM - Last modified Mar. 16, 2016 12:16 AM