Web-Scale Natural Language Processing in Northern Europe

Research centers from across Northern Europe, the US-based Common Crawl Foundation, and the Nordic e-Infrastructure Collaboration (NeIC) work together to enable very large-scale computational experimentation in natural language processing.  This one-day workshop serves to kick off the collaboration, discuss community needs, and—through an open, popular-science session towards the end of the day—generate broader visibility.

Background and Motivation

Natural Langage Processing (NLP) is the inter-disciplinary branch of (among others) Computer Science and Linguistics that enables everyday technologies and services like, for example, automated translation (e.g. Google Translate), human–machine interaction in spoken language (e.g. Apple's Siri), or content recommendation and contextual advertisement (e.g. on on-line news sites).  As such, language-enabled technologies are of rapidly growing societal and commercial relevance.  In the past decade, NLP research has increasingly grown compute-intensive and data-driven—to a point that to date makes some (very large-scale) methodologies a ‘privilege’ of researchers who work in only the largest multi-national corporations.

In principle, NLP researchers in Northern Europe have available modern and comparatively extensive, public e-infrastructures, e.g. high-performance computing (HPC) and storage facilities.  However, there is no tradition of HPC utilization for NLP research yet, and candidate user communities have only recently begun to gain access to and learn to use national computing facilities.  Thus, this one-day meeting brings together researchers in natural language processing from across Northern Europe and other relevant stakeholders, to (a) facilitate ‘community building’ across national borders; (b) exchange experience reports and research visions; (c) kick off a collaboration with the Common Crawl Foundation; and (d) identify trans-national e-infrastructure community needs, as input to the Nordic e-Infrastructure Collaboration (NeIC).

Public Seminar:
Common Crawl, Open Web Data, and Nordic HPC—An Equal-Opportunity Research Infrastructure

17:00 17:30 Lisa Green & Stephen Merity (Common Crawl) The Common Crawl: Web Data for Everyone
17:30 17:50 Jörg Tiedemann (Uppsala) Scalability in Statistical Machine Translation Research
17:50 18:10 Gudmund Høst (NeIC) E-Infrastructure Opportunities for Northern Europe
18:10 18:30 Ole Widar Saastad (Oslo) ABEL and NorStore: National HPC Facilities Hosted by UiO
18:30 19:30 Informal Gathering Snacks & Guided Tour of HPC Facilities


(Internal) Workshop:
Web-Scale Natural Language Processing in Northern Europe

11:00 11:10 Stephan Oepen (Oslo) Welcome
11:10 11:40 Stephen Merity (Common Crawl) The Common Crawl for NLP Research
11:40 12:00 Filip Ginter (Turku) Seeding the Finnish Internet ParseBank with CommonCrawl: An Experience Report
12:00 12:20 Lars Bungum (Trondheim) Clustering the SdeWaC Corpus Using a Parallel SOM Implementation
12:20 12:40 Joakim Nivre (Uppsala) Big is Beautiful or Less is More? Reflections on Resource-Intensive NLP
12:40 13:00 Discussion HPC Experience for NLP
13:00 14:00 Lunch (On-Site)  
14:00 14:20 Barbara Plank (Copenhagen) Benefits of HPC for NLP besides Big Data
14:20 14:40 Tommi Jauhiainen (Helsinki) Finno-Ugric Languages and the Internet
14:40 15:00 Krister Linden (Helsinki) Experiences from Parsing Finnish and Swedish Billion-Word Corpora.
15:00 15:20 Stephan Oepen (Oslo) Towards a Nordic Language Processing Grid
15:20 15:40 Dejan Vitlacil (NeIC) NeIC and Generic Area Coordination
15:40 16:30 Discussion E-Infrastructure Community Needs

Travel Support

With financial support from NeIC and the Norwegian WeSearch project, we can make available up to five travel stipends to (primarily earlier-stage) researchers who might otherwise be unable to participate.  Please contact Stephan Oepen for details about stipend availability and allocation.

Hands-On Follow-Up

Both Lisa and Stephen stay in Oslo for the day following the workshop, i.e. Tuesday, November 25, 2014.  The plan is to use the extra time to discuss technical details of the Common Crawl data, possible use patterns of the data for NLP research, candidate synchronization of web harvesting for the languages of Northern Europe, and the creation of a ‘mirror’ of (relevant parts of) the Common Crawl on NorStore facilities.  All workshop participants are welcome to stay overnight and use Tuesday for sub-group coordination and hands-on activities.  Stephan Oepen will be happy to assist in making accomodation arrangements.

Expected Participants

  • Lars Bungum, Norwegian University of Science and Technology (Norway)
  • Arjun Chandra, University of Oslo (Norway)
  • Johan Benum Evensberget, Computas (Norway)
  • Tommi Jauhiainen, University of Helsinki (Finland)
  • Gudmund Høst, NeIC (Northern Europe)
  • Ciprian Gerstenberger, Arctic University of Norway (Norway)
  • Filip Ginter, University of Turku (Finland)
  • Kjetil Kjernsmo, University of Oslo (Norway)
  • Milen Kouylekov, University of Oslo (Norway)
  • Michał Kosek, University of Oslo (Norway)
  • Elisabeth Lien, University of Oslo (Norway)
  • Krister Lindén, University of Helsinki (Finland)
  • Pierre Lison, University of Oslo (Norway)
  • Juhani Luotolahti, University of Turku (Finland)
  • Jan Tore Lønning, University of Oslo (Norway)
  • Joakim Nivre, Uppsala University (Sweden)
  • Anders Nøklestad, University of Oslo (Norway)
  • Stephan Oepen, University of Oslo (Norway)
  • Barbara Plank, University of Copenhagen (Denmark)
  • Thomas Röblitz, NeIC (Northern Europe)
  • Arne Skjærholt, University of Oslo (Norway)
  • Diana Santos, University of Oslo (Norway)
  • Koenraad de Smedt, University of Bergen (Norway)
  • Jörg Tiedemann, Uppsala University (Sweden)
  • Amund Tveit, Memkite (Norway)
  • Erik Velldal, University of Oslo (Norway)
  • Dejan Vitlacil, NeIC (Northern Europe)


Stephan Oepen
Published Oct. 23, 2014 1:43 PM - Last modified Oct. 14, 2016 2:21 PM