Web-Scale Natural Language Processing in Northern Europe
Research centers from across Northern Europe, the US-based Common Crawl Foundation, and the Nordic e-Infrastructure Collaboration (NeIC) work together to enable very large-scale computational experimentation in natural language processing. This one-day workshop serves to kick off the collaboration, discuss community needs, and—through an open, popular-science session towards the end of the day—generate broader visibility.
Background and Motivation
Natural Langage Processing (NLP) is the inter-disciplinary branch of (among others) Computer Science and Linguistics that enables everyday technologies and services like, for example, automated translation (e.g. Google Translate), human–machine interaction in spoken language (e.g. Apple's Siri), or content recommendation and contextual advertisement (e.g. on on-line news sites). As such, language-enabled technologies are of rapidly growing societal and commercial relevance. In the past decade, NLP research has increasingly grown compute-intensive and data-driven—to a point that to date makes some (very large-scale) methodologies a ‘privilege’ of researchers who work in only the largest multi-national corporations.
In principle, NLP researchers in Northern Europe have available modern and comparatively extensive, public e-infrastructures, e.g. high-performance computing (HPC) and storage facilities. However, there is no tradition of HPC utilization for NLP research yet, and candidate user communities have only recently begun to gain access to and learn to use national computing facilities. Thus, this one-day meeting brings together researchers in natural language processing from across Northern Europe and other relevant stakeholders, to (a) facilitate ‘community building’ across national borders; (b) exchange experience reports and research visions; (c) kick off a collaboration with the Common Crawl Foundation; and (d) identify trans-national e-infrastructure community needs, as input to the Nordic e-Infrastructure Collaboration (NeIC).
Both the seminar and the workshop are held in seminar room Smalltalk, on the ground level (aka 1st floor) of the Department of Informatics (Ole Johan Dahls hus), right by the main entrance.
Common Crawl, Open Web Data, and Nordic HPC—An Equal-Opportunity Research Infrastructure
|17:00||17:30||Lisa Green & Stephen Merity (Common Crawl)||The Common Crawl: Web Data for Everyone|
|17:30||17:50||Jörg Tiedemann (Uppsala)||Scalability in Statistical Machine Translation Research|
|17:50||18:10||Gudmund Høst (NeIC)||E-Infrastructure Opportunities for Northern Europe|
|18:10||18:30||Ole Widar Saastad (Oslo)||ABEL and NorStore: National HPC Facilities Hosted by UiO|
|18:30||19:30||Informal Gathering||Snacks & Guided Tour of HPC Facilities|
To sign up for the public seminar at the end of the day, please register on-line, using the link on the right-hand side (registration will inform catering, among other things). For participation in the workshop, please just contact Stephan Oepen informally; in case you would like to present on your own research, there is still room in the programme.
Web-Scale Natural Language Processing in Northern Europe
|11:00||11:10||Stephan Oepen (Oslo)||Welcome|
|11:10||11:40||Stephen Merity (Common Crawl)||The Common Crawl for NLP Research|
|11:40||12:00||Filip Ginter (Turku)||Seeding the Finnish Internet ParseBank with CommonCrawl: An Experience Report|
|12:00||12:20||Lars Bungum (Trondheim)||Clustering the SdeWaC Corpus Using a Parallel SOM Implementation|
|12:20||12:40||Joakim Nivre (Uppsala)||Big is Beautiful or Less is More? Reflections on Resource-Intensive NLP|
|12:40||13:00||Discussion||HPC Experience for NLP|
|14:00||14:20||Barbara Plank (Copenhagen)||Benefits of HPC for NLP besides Big Data|
|14:20||14:40||Tommi Jauhiainen (Helsinki)||Finno-Ugric Languages and the Internet|
|14:40||15:00||Krister Linden (Helsinki)||Experiences from Parsing Finnish and Swedish Billion-Word Corpora.|
|15:00||15:20||Stephan Oepen (Oslo)||Towards a Nordic Language Processing Grid|
|15:20||15:40||Dejan Vitlacil (NeIC)||NeIC and Generic Area Coordination|
|15:40||16:30||Discussion||E-Infrastructure Community Needs|
With financial support from NeIC and the Norwegian WeSearch project, we can make available up to five travel stipends to (primarily earlier-stage) researchers who might otherwise be unable to participate. Please contact Stephan Oepen for details about stipend availability and allocation.
Both Lisa and Stephen stay in Oslo for the day following the workshop, i.e. Tuesday, November 25, 2014. The plan is to use the extra time to discuss technical details of the Common Crawl data, possible use patterns of the data for NLP research, candidate synchronization of web harvesting for the languages of Northern Europe, and the creation of a ‘mirror’ of (relevant parts of) the Common Crawl on NorStore facilities. All workshop participants are welcome to stay overnight and use Tuesday for sub-group coordination and hands-on activities. Stephan Oepen will be happy to assist in making accomodation arrangements.
- Lars Bungum, Norwegian University of Science and Technology (Norway)
- Arjun Chandra, University of Oslo (Norway)
- Johan Benum Evensberget, Computas (Norway)
- Tommi Jauhiainen, University of Helsinki (Finland)
- Gudmund Høst, NeIC (Northern Europe)
- Ciprian Gerstenberger, Arctic University of Norway (Norway)
- Filip Ginter, University of Turku (Finland)
- Kjetil Kjernsmo, University of Oslo (Norway)
- Milen Kouylekov, University of Oslo (Norway)
- Michał Kosek, University of Oslo (Norway)
- Elisabeth Lien, University of Oslo (Norway)
- Krister Lindén, University of Helsinki (Finland)
- Pierre Lison, University of Oslo (Norway)
- Juhani Luotolahti, University of Turku (Finland)
- Jan Tore Lønning, University of Oslo (Norway)
- Joakim Nivre, Uppsala University (Sweden)
- Anders Nøklestad, University of Oslo (Norway)
- Stephan Oepen, University of Oslo (Norway)
- Barbara Plank, University of Copenhagen (Denmark)
- Thomas Röblitz, NeIC (Northern Europe)
- Arne Skjærholt, University of Oslo (Norway)
- Diana Santos, University of Oslo (Norway)
- Koenraad de Smedt, University of Bergen (Norway)
- Jörg Tiedemann, Uppsala University (Sweden)
- Amund Tveit, Memkite (Norway)
- Erik Velldal, University of Oslo (Norway)
- Dejan Vitlacil, NeIC (Northern Europe)