Publications
-
Lison, Pierre; Barnes, Jeremy; Hubin, Aliaksandr & Touileb, Samia (2020). Named Entity Recognition without Labelled Data: A Weak Supervision Approach, In Dan Jurafsky; Joyce Chai; Natalie Schluter & Joel Tetreault (ed.),
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.
Association for Computational Linguistics.
ISBN 978-1-952148-25-5.
139.
s 1518
- 1533
Show summary
Named Entity Recognition (NER) performance often degrades rapidly when applied to target domains that differ from the texts observed during training. When in-domain labelled data is available, transfer learning techniques can be used to adapt existing NER models to the target domain. But what should one do when there is no hand-labelled data for the target domain? This paper presents a simple but powerful approach to learn NER models in the absence of labelled data through weak supervision. The approach relies on a broad spectrum of labelling functions to automatically annotate texts from the target domain. These annotations are then merged together using a hidden Markov model which captures the varying accuracies and confusions of the labelling functions. A sequence labelling model can finally be trained on the basis of this unified annotation. We evaluate the approach on two English datasets (CoNLL 2003 and news articles from Reuters and Bloomberg) and demonstrate an improvement of about 7 percentage points in entity-level F1 scores compared to an out-of-domain neural NER model.
-
Jang, Youngsoo; Lee, Jongmin; Park, Jaeyoung; Lee, Kyeng-Hun; Lison, Pierre & Kee-Eung, Kim (2019). PyOpenDial: A Python-based Domain-Independent Toolkit for Developing Spoken Dialogue Systems with Probabilistic Rules, In Ruihong Huang & Sebastian Padó (ed.),
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP 2019), System Demonstrations.
Association for Computational Linguistics.
ISBN 9781950737925.
System demonstrations.
s 187
- 192
Show summary
We present PyOpenDial, a Python-based domain-independent, open-source toolkit for spoken dialogue systems. Recent advances in core components of dialogue systems, such as speech recognition, language understanding, dialogue management, and language generation, harness deep learning to achieve state-of-the-art performance. The original OpenDial, implemented in Java, provides a plugin architecture to integrate external modules, but lacks Python bindings, making it difficult to interface with popular deep learning frameworks such as Tensorflow or PyTorch. To this end, we re-implemented OpenDial in Python and extended the toolkit with a number of novel functionalities for neural dialogue state tracking and action planning. We describe the overall architecture and its extensions, and illustrate their use on an example where the system response model is implemented with a recurrent neural network.
-
Lison, Pierre; Tiedemann, Jörg & Kouylekov, Milen (2018). OpenSubtitles 2018: Statistical rescoring of sentence alignments in large, noisy parallel corpora, In Nicoletta Calzolari; Khalid Choukri; Christopher Cieri; Thierry Declerck; Sara Goggi; Koiti Hasida; Hitoshi Isahara; Bente Maegaard; Joseph Mariani; Hélène Mazo; Asuncion Moreno; Jan Odijk; Stelios Piperidis & Takenobu Tokunaga (ed.),
Proceedings of the Eleventh International Conference on Language Resources and Evaluation.
European Language Resources Association.
ISBN 979-10-95546-00-9.
papers.
s 1742
- 1748
Show summary
Movie and TV subtitles are a highly valuable resource for the compilation of parallel corpora thanks to their availability in large numbers and across many languages. However, the quality of the resulting sentence alignments is often lower than for other parallel corpora. This paper presents a new major release of the OpenSubtitles collection of parallel corpora, which is extracted from a total of 3.7 million subtitles spread over 60 languages. In addition to a substantial increase in the corpus size (about 30 % compared to the previous version), this new release associates explicit quality scores to each sentence alignment. These scores are determined by a statistical regression model based on simple language-independent features and estimated on a small sample of aligned sentence pairs. Evaluation results show that the model is able predict lexical translation probabilities with a root mean square error of 0.07 (coefficient of determination R2 = 0.47). Based on the scores produced by this regression model, the parallel corpora can be filtered to prune out alignments with a score below a given threshold.
-
Lison, Pierre & Dogruöz, A. Seza (2018). Detecting Machine-translated Documents in Large Parallel Corpora, In Reinhard Rapp; Pierre Zweigenbaum & Serge Sharoff (ed.),
11th Workshop on Building and Using Comparable Corpora (BUCC 2018).
European Language Resources Association.
ISBN 979-10-95546-07-8.
papers.
s 25
- 32
Show summary
Parallel corpora extracted from online repositories of movie and TV subtitles are employed in a wide range of NLP applications, from language modelling to machine translation and dialogue systems. However, the subtitles uploaded in such repositories exhibit varying levels of quality. A particularly difficult problem stems from the fact that a substantial number of these subtitles are not written by human subtitlers but are simply generated through the use of online translation engines. This paper investigates whether these machine-generated subtitles can be detected automatically using a combination of linguistic and extra-linguistic features. We show that a feedforward neural network trained on a small dataset of subtitles can detect machine-generated subtitles with a F1-score of 0.64. Furthermore, applying this detection model on an unlabelled sample of subtitles allows us to provide a statistical estimate for the proportion of subtitles that are machine-translated (or are at least of very low quality) in the full corpus.
-
Lison, Pierre & Bibauw, Serge (2017). Not All Dialogues are Created Equal: Instance Weighting for Neural Conversational Models, In David DeVault & Annie Louis (ed.),
18th Annual SIGdial Meeting on Discourse and Dialogue (SIGDIAL 2017).
Association for Computational Linguistics.
ISBN 978-1-945626-82-1.
Proceedings.
s 384
- 394
Show summary
Neural conversational models require substantial amounts of dialogue data for their parameter estimation and are therefore usually learned on large corpora such as chat forums or movie subtitles. These corpora are, however, often challenging to work with, notably due to their frequent lack of turn segmentation and the presence of multiple references external to the dialogue itself. This paper shows that these challenges can be mitigated by adding a weighting model into the architecture. The weighting model, which is itself estimated from dialogue data, associates each training example to a numerical weight that reflects its intrinsic quality for dialogue modelling. At training time, these sample weights are included into the empirical loss to be minimised. Evaluation results on retrieval-based models trained on movie and TV subtitles demonstrate that the inclusion of such a weighting model improves the model performance on unsupervised metrics.
-
Lison, Pierre & Kennington, Casey (2017). Incremental Processing for Neural Conversational Models. SemDial Proceedings.
ISSN 2308-2275.
s 162- 163
Show summary
We present a simple approach to adapt neural conversation models to incremental processing. The approach is validated with a proof-of-concept experiment in a visual reference resolution task.
-
Lison, Pierre & Kutuzov, Andrei (2017). Redefining Context Windows for Word Embedding Models: An Experimental Study, In Jörg Tiedemann (ed.),
Proceedings of the 21st Nordic Conference on Computational Linguistics (NoDaLiDa).
Linköping University Electronic Press.
ISBN 978-91-7685-601-7.
chapter.
s 284
- 288
Full text in Research Archive.
Show summary
Distributional semantic models learn vector representations of words through the contexts they occur in. Although the choice of context (which often takes the form of a sliding window) has a direct influence on the resulting embeddings, the exact role of this model component is still not fully understood. This paper presents a systematic analysis of context windows based on a set of four distinct hyperparameters. We train continuous Skip- Gram models on two English-language corpora for various combinations of these hyper-parameters, and evaluate them on both lexical similarity and analogy tasks. Notable experimental results are the positive impact of cross-sentential contexts and the surprisingly good performance of right-context windows.
-
Lison, Pierre & Mavroeidis, Vasileios (2017). Automatic Detection of Malware-Generated Domains with Recurrent Neural Models. Norsk Informasjonssikkerhetskonferanse (NISK).
ISSN 1893-6563.
Show summary
Modern malware families often rely on domain-generation algorithms (DGAs) to determine rendezvous points to their command-and-control server. Traditional defence strategies (such as blacklisting domains or IP addresses) are inadequate against such techniques due to the large and continuously changing list of domains produced by these algorithms. This paper demonstrates that a machine learning approach based on recurrent neural networks is able to detect domain names generated by DGAs with high precision. The neural models are estimated on a large training set of domains generated by various malwares. Experimental results show that this data-driven approach can detect malware-generated domain names with a F1 score of 0.971. To put it differently, the model can automatically detect 93 % of malware-generated domain names for a false positive rate of 1:100.
-
Lison, Pierre & Mavroeidis, Vasileios (2017). Neural Reputation Models learned from Passive DNS data, In
IEEE Big Data 1st International Workshop on Big Data Analytic for Cyber Crime Investigation and Prevention 2017.
IEEE.
ISBN 978-1-5386-2715-0.
1.
s 3662
- 3671
Show summary
Blacklists and whitelists are often employed to filter outgoing and incoming traffic on computer networks. One central function of these lists is to mitigate the security risks posed by malware threats by associating a reputation (for instance benign or malicious) to end-point hosts. The creation and maintenance of these lists is a complex and time-consuming process for security experts. As a consequence, blacklists and whitelists are prone to various errors, inconsistencies and omissions, as only a tiny fraction of end-point hosts are effectively covered by the reputation lists. In this paper, we present a machine learning model that is able to automatically detect whether domain names and IP addresses are benign, malicious or sinkholes. The model relies on a deep neural architecture and is trained on a large passive DNS database. Evaluation results demonstrate the effectiveness of the approach, as the model is able to detect malicious DNS records with a F1 score of 0.96. In other words, the model is able to detect 95 % of the malicious hosts with a false positive rate of 1:1000.
-
Dragone, Paolo & Lison, Pierre (2016). Classification and Resolution of Non-Sentential Utterances in Dialogue. Italian Journal of Computational Linguistics.
ISSN 2499-4553.
2(1), s 45- 62
Show summary
This article addresses the problems of classification and resolution of non-sentential utterances (NSUs) in dialogue. NSUs are utterances that do not have a complete sentential form but convey a full clausal meaning given the conversational context, such as "To the contrary!" or "How much?". The presented approach builds upon the work of Fernández, Ginzburg, and Lappin (2007), who provide a taxonomy of NSUs divided in 15 classes along with a small annotated corpus extracted from dialogue transcripts. The main part of this article focuses on the automatic classification of NSUs according to these classes. We show that a combination of novel linguistic features and active learning techniques yields a significant improvement in the classification accuracy over the state-of-the-art, and is able to mitigate the scarcity of labelled data. Based on this classifier, the article also presents a novel approach for the semantic resolution of NSUs in context using probabilistic rules.
-
Lison, Pierre & Kennington, Casey (2016). OpenDial: A Toolkit for Developing Spoken Dialogue Systems with Probabilistic Rules, In Sameer Pradhan & Marianna Apidianaki (ed.),
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics.
Association for Computational Linguistics.
ISBN 978-1-945626-03-6.
1.
s 67
- 72
Show summary
We present a new release of OpenDial, an open-source toolkit for building and evaluating spoken dialogue systems. The toolkit relies on an information-state architecture where the dialogue state is represented as a Bayesian network and acts as a shared memory for all system modules. The domain models are specified via probabilistic rules encoded in XML. OpenDial has been deployed in several application domains such as human–robot interaction, intelligent tutoring systems and multi-modal in-car driver assistants.
-
Lison, Pierre & Meena, Raveesh (2016). Automatic Turn Segmentation of Movie and TV Subtitles, In Najim Dehak & Pedro Torres-Carrasquillo (ed.),
2016 Spoken Language Technology Workshop.
IEEE.
ISBN 978-1-5090-4902-8.
1.
s 245
- 252
Show summary
Movie and TV subtitles contain large amounts of conversational material, but lack an explicit turn structure. This paper present a data-driven approach to the segmentation of subtitles into dialogue turns. Training data is first extracted by aligning subtitles with transcripts in order to obtain speaker labels. This data is then used to build a classifier whose task is to determine whether two consecutive sentences are part of the same dialogue turn. The approach relies on linguistic, visual and timing features extracted from the subtitles themselves and does not require access to the audiovisual material -- although speaker diarization can be exploited when audio data is available. The approach also exploits alignments with related subtitles in other languages to further improve the classification performance. The classifier achieves an accuracy of 78% on a held-out test set. A follow-up annotation experiment demonstrates that this task is also difficult for human annotators.
-
Lison, Pierre & Tiedemann, Jörg (2016). OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles, In
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016).
European Language Resources Association.
ISBN 978-2-9517408-9-1.
1.
s 923
- 929
Full text in Research Archive.
Show summary
We present a new major release of the OpenSubtitles collection of parallel corpora. The release is compiled from a large database of movie and TV subtitles and includes a total of 1689 bitexts spanning 2.6 billion sentences across 60 languages. The release also incorporates a number of enhancements in the preprocessing and alignment of the subtitles, such as the automatic correction of OCR errors and the use of meta-data to estimate the quality of each subtitle and score subtitle pairs.
-
Stoyanchev, Svetlana; Lison, Pierre & Bangalore, Srinivas (2016). Rapid Prototyping of Form-driven Dialogue Systems Using an Open-source Framework, In Raquel Fernández & Wolfgang Minker (ed.),
Proceedings of the 17th Annual Meeting of the Special Interest Group on Discourse and Dialogue.
Association for Computational Linguistics.
ISBN 978-1-945626-23-4.
1.
s 216
- 219
Show summary
Most human-machine communication for information access through speech, text and graphical interfaces are mediated by forms – i.e. lists of named fields. However, deploying form-filling dialogue systems still remains a challenging task due to the effort and skill required to author such systems. We describe an extension to the OpenDial framework that enables the rapid creation of functional dialogue systems by non-experts. The dialogue designer specifies the slots and their types as input and the tool generates a domain specification that drives a slot-filling dialogue system. The presented approach provides several benefits compared to traditional techniques based on flowcharts, such as the use of probabilistic reasoning and flexible grounding strategies.
-
Dragone, Paolo & Lison, Pierre (2015). An Active Learning Approach to the Classification of Non-Sentential Utterances, In Cristina Bosco; Sara Tonelli & Fabio Massimo Zanzotto (ed.),
Proceedings of the Second Italian Conference on Computational Linguistics.
CLiC-IT 2015.
ISBN 978-88-99200-62-6.
papers.
s 115
- 119
Show summary
This paper addresses the problem of classification of non-sentential utterances (NSUs). NSUs are utterances that do not have a complete sentential form but convey a full clausal meaning given the dialogue context. We extend the approach of Fernández et al. (2007), which provide a taxonomy of NSUs and a small annotated corpus extracted from dialogue transcripts. This paper demonstrates how the combination of new linguistic features and active learning techniques can mitigate the scarcity of labelled data. The results show a significant improvement in the classification accuracy over the state-of-the-art.
-
Dragone, Paolo & Lison, Pierre (2015). Non-sentential utterances in dialogue: experiments in classification and interpretation. SemDial Proceedings.
ISSN 2308-2275.
s 170- 172
-
Lison, Pierre (2015). A hybrid approach to dialogue management based on probabilistic rules. Computer Speech and Language.
ISSN 0885-2308.
34(1), s 232- 255 . doi:
10.1016/j.csl.2015.01.001
-
Lison, Pierre & Kennington, Casey (2015). Developing Spoken Dialogue Systems with the OpenDial Toolkit. SemDial Proceedings.
ISSN 2308-2275.
s 194- 196
-
Kosek, Michal Kajetan & Lison, Pierre (2014). An Intelligent Tutoring System for Learning Chinese with a Cognitive Model of the Learner, In Peppi Taalas (ed.),
Proceedings of EUROCALL 2014.
Research-publishing.net.
ISBN 978-1-908416-19-3.
papers.
s 179
- 184
Show summary
We present an Intelligent Tutoring System that lets students of Chinese learn words and grammatical constructions. It relies on a Bayesian, linguistically motivated cognitive model that represents the learner’s knowledge. This model is dynamically updated given observations about the learner’s behaviour in the exercises, and employed at runtime to select the exercises that are expected to maximise the learning outcome. Compared with a baseline that randomly chooses exercises at user’s declared level, the system shows positive effects on users’ assessment of how much they have learnt, which suggests that it leads to enhanced learning.
-
Lison, Pierre (2013). Model-based Bayesian Reinforcement Learning for Dialogue Management. Proceedings of the International Conference on Spoken Language Processing.
ISSN 1990-9772.
Full text in Research Archive.
-
Lison, Pierre (2013). Towards Online Planning for Dialogue Management with Rich Domain Knowledge, In Joseph Mariani; Laurence Devillers; Martine Garnier-Rizet & Sophie Rosset (ed.),
Natural Interaction with Robots, Knowbots and Smartphones - Putting Spoken Dialog Systems into practice.
Springer.
ISBN 978-1-4614-8279-6.
proceedings.
-
Lison, Pierre (2012). Declarative Design of Spoken Dialogue Systems with Probabilistic Rules. SemDial Proceedings.
ISSN 2308-2275.
Show summary
Spoken dialogue systems are instantiated in complex architectures comprising multiple interconnected components. These architectures often take the form of pipelines whose components are essentially black-boxes developed and optimised separately, using ad-hoc specification formats for their inputs and outputs, domain models and parameters. We present in this paper an alternative modelling approach, in which the dialogue processing steps (from understanding to management and to generation) are all declaratively specified using the same underlying formalism. The formalism is based on probabilistic rules operating on a shared belief state. These rules are expressed as structured mapping between state variables and provide a compact, probabilistic encoding for the dialogue processing models. We argue that this declarative approach yields several advantages in terms of transparency, domain-portability and adaptivity over traditional black-box architectures. We also describe the implementation and validation of this approach in an integrated architecture for human-robot interaction.
-
Lison, Pierre (2012). Probabilistic Dialogue Models with Prior Domain Knowledge, In Gary Geunbae Lee & Jonathan Ginzburg (ed.),
SIGDIAL 2012: Proceedings of 13th Annual Meeting of the Special Interest Group on Discourse and Dialogue.
Association for Computational Linguistics.
ISBN 978-1-937284-44-2.
proceedings.
s 179
- 188
Full text in Research Archive.
Show summary
Probabilistic models such as Bayesian Networks are now in widespread use in spoken dialogue systems, but their scalability to complex interaction domains remains a challenge. One central limitation is that the state space of such models grows exponentially with the problem size, which makes parameter estimation increasingly difficult, especially for domains where only limited training data is available. In this paper, we show how to capture the underlying structure of a dialogue domain in terms of probabilistic rules operating on the dialogue state. The probabilistic rules are associated with a small, compact set of parameters that can be directly estimated from data. We argue that the introduction of this abstraction mechanism yields probabilistic models that are easier to learn and generalise better than their unstructured counterparts. We empirically demonstrate the benefits of such an approach learning a dialogue policy for a human-robot interaction domain based on a Wizard-of-Oz data set.
-
Lison, Pierre (2012). Towards Dialogue Management in Relational Domains, In Simon Dobnik; Staffan Larsson & Robin Cooper (ed.),
Proceedings of the SLTC Workshop on Action, Perception and Language (APL 2012).
APL workshop organisers.
proceedings.
Show summary
Traditional approaches to dialogue management rely on a fixed, predefined set of state variables. For many application domains, the dialogue state is however best described in terms of a collection of varying number of entities and relations holding between them. These entities might correspond to objects, places or persons in the context of the interaction, or represent a set of tasks to perform. Such formalization of the state space is well-suited for many domains, but presents some challenges for the standard probabilistic models used in dialogue management, since these models are propositional in nature and thus unable to directly operate on such state representation. To address this issue, we present an alternative approach based on the use of expressive probabilistic rules that allow for limited forms of universal quantification. These rules take the form of structured mappings between input and output variables, and function as high-level templates for the probability and utility models integrated in the dialogue manager. We present in this abstract the general formalisation of this approach, focusing on the use of universal quantifiers to capture the relational structure of the domain.
-
Lison, Pierre (2012). Towards Online Planning for Dialogue Management with Rich Domain Knowledge, In Joseph Mariani; Laurence Devillers; Martine Garnier-Rizet & Sophie Rosset (ed.),
Proceedings of the Fourth International Workshop on Spoken Dialog Systems (IWSDS 2012): Towards a Natural Interaction with Robots, Knowbots and Smartphones.
IWSDS Program committee.
proceedings.
Show summary
Most approaches to dialogue management have so far concentrated on offline optimisation techniques, where a dialogue policy is precomputed for all possible situations and then plugged into the dialogue system. This development strategy has however some limitations in terms of domain scalability and adaptivity, since these policies are essentially static and cannot readily accommodate runtime changes in the environment or task dynamics. In this paper, we follow an alternative approach based on online planning. To ensure that the planning algorithm remains tractable over longer horizons, the presented method relies on probabilistic models expressed via probabilistic rules that capture the internal structure of the domain using high-level representations. We describe in this paper the generic planning algorithm, ongoing implementation efforts and directions for future work.
-
Lison, Pierre (2011). Multi-Policy Dialogue Management, In David Traum & Johanna Moore (ed.),
Proceedings of the 2011 SIGDIAL Conference.
Association for Computational Linguistics.
ISBN 978-1-937284-10-7.
papers.
s 294
- 300
Full text in Research Archive.
Show summary
We present a new approach to dialogue management based on the use of multiple, interconnected policies. Instead of capturing the complexity of the interaction in a single large policy, the dialogue manager operates with a collection of small local policies combined concurrently and hierarchically. The meta-control of these policies relies on an activation vector updated before and after each turn.
-
Kruijff, Geert-Jan; Janiček, Miroslav & Lison, Pierre (2010). Continual processing of situated dialogue in human-robot collaborative activities, In Carlo Alberto Avizzano & Emanuele Ruffaldi (ed.),
Robot and Human Interactive Communication, IEEE International Symposium (RO-MAN 2010).
IEEE Press.
ISBN 9781424479917.
1.
-
Kruijff, Geert-Jan; Lison, Pierre; Benjamin, Trevor; Jacobsson, Henrik; Zender, Hendrik & Kruijff-Korbayová, Ivana (2010). Situated Dialogue Processing for Human-Robot Interaction, In Henrik Christensen; Geert-Jan Kruijff & Jeremy Wyatt (ed.),
Cognitive Systems (Cognitive Systems Monographs).
Springer.
ISBN 3642116930.
8.
-
Lison, Pierre (2010). A salience-driven approach to speech recognition for human-robot interaction, In Thomas Icard & Reinhard Muskens (ed.),
Interfaces: Explorations in Logic, Language and Computation.
Springer.
ISBN 978-3-642-14728-9.
1.
-
Lison, Pierre (2010). Towards Relational POMDPs for Adaptive Dialogue Management, In Nils Reiter; Jan Raab & Seniz Demir (ed.),
Proceedings of the Student Research Workshop of the 48th Annual Meeting of the Association for Computational Linguistics.
Association for Computational Linguistics.
ISBN 978-1-932432-67-1.
1.
-
Lison, Pierre; Ehrler, Carsten & Kruijff, Geert-Jan (2010). Belief modelling for situation awareness in human-robot interaction, In Carlo Alberto Avizzano & Emanuele Ruffaldi (ed.),
Robot and Human Interactive Communication, IEEE International Symposium (RO-MAN 2010).
IEEE Press.
ISBN 9781424479917.
1.
-
Lison, Pierre & Kruijff, Geert-Jan (2010). Policy activation for open-ended dialogue management, In Dan Bohus; Eric Horvitz; Takayuki Kanda; Bilge Mutlu & Antoine Raux (ed.),
Dialog with Robots: Papers from the AAAI Fall Symposium.
AAAI Press.
ISBN 978-1-57735-487-1.
1.
-
Wyatt, Jeremy; Aydemir, Alper; Brenner, Michael; Hanheide, Marc; Hawes, Nick; Jensfelt, Patric; Kruijff, Geert-Jan; Kristan, Matej; Lison, Pierre; Pronobis, Andrzej; Sjöö, Kristoffer; Skočaj, Danijel; Vrečko, Alen; Zillich, Michael & Zender, Hendrik (2010). Self-Understanding and Self-Extension: A Systems and Representational Approach. IEEE Transactions on Autonomous Mental Development.
ISSN 1943-0604.
2(4), s 282- 303
-
Lison, Pierre (2009). Robust processing of situated spoken dialogue, In Christian Chiarcos; Richard Eckart de Castilho & Manfred Stede (ed.),
Von der Form zur Bedeutung : Texte automatisch verarbeiten / From Form to Meaning : Processing Texts Automatically.
Gunter Narr Verlag.
ISBN 3823365118.
1.
-
Lison, Pierre & Kruijff, Geert-Jan (2009). Efficient parsing of spoken inputs for human-robot interaction, In Takanori Shibata (ed.),
Robot and Human Interactive Communication, IEEE International Symposium (RO-MAN 2009).
IEEE Press.
ISBN 9781424450800.
1.
-
Lison, Pierre & Kruijff, Geert-Jan (2009). Robust processing of situated spoken dialogue, In Bärbel Mertsching; Markus Hund & Zaheer Aziz (ed.),
KI 2009: Advances in Artificial Intelligence.
Springer.
ISBN 978-3-642-04616-2.
1.
-
Lison, Pierre & Kruijff, Geert-Jan (2008). Salience-driven contextual priming of speech recognition for human-robot interaction, In Malik Ghallab; Constantine Spyropoulos; Nikos Fakotakis & Nikolaos Avouris (ed.),
ECAI 2008 - 18th European Conference on Artificial Intelligence.
IOS Press.
ISBN 978-1-58603-891-5.
1.
View all works in Cristin
-
Lison, Pierre; Nilsson, Mattias & Recasens, Marta (ed.) (2012). Proceedings of the Student Research Workshop at the 13th Conference of the European Chapter of the Association for Computational Linguistics.
Association for Computational Linguistics.
ISBN 978-1-937284-19-0.
101 s.
-
Lison, Pierre (2010). Robust Processing of Spoken Situated Dialogue - A Study in Human-Robot Interaction.
Diplomica Verlag.
ISBN 3836691132.
202 s.
View all works in Cristin
-
Lison, Pierre (2020). Developing NLP models without labelled data using weak supervision.
-
Lison, Pierre (2020). Named Entity Recognition without Labelled Data: A Weak Supervision Approach.
-
Lison, Pierre & Falkum, Ingrid Lossius (2020). Kan kunstig intelligens "forstå" språk?. Aftenposten (morgenutg. : trykt utg.).
ISSN 0804-3116.
-
Løland, Anders & Lison, Pierre (2020, 23. september). Episode 5: Hva er språkteknologi (eller NLP)? Med Pierre Lison. [Internett].
Sannsynligvis VIKTIG (podkast).
-
Løland, Anders; Lison, Pierre & Falkum, Ingrid Lossius (2020, 26. september). Episode 6: Kan språkteknologi virkelig forstå språk? Med Ingrid Lossius Falkum og Pierre Lison. [Internett].
Sannsynligvis VIKTIG (podkast).
-
Riegler, Michael; Lison, Pierre; Strümke, Inga & Løland, Anders (2020). For enkelt om kunstig intelligens: – Diskriminerende og fordomsfull AI er ikke alltid lett å løse. Forskning.no.
ISSN 1891-635X.
-
Lison, Pierre (2019). Data-driven models of reputation for cybersecurity.
-
Lison, Pierre (2019). Dialogue Modelling: Small data, Big data.
-
Lison, Pierre (2019). Modellering av omdømme i cybersikkerhet med nevralske nettverk.
-
Lison, Pierre (2019). Modélisation du dialogue: contrôle du dialogue et corpus multilingues.
-
Lison, Pierre (2019). Open challenges in anonymisation.
-
Prévot, Laurent; Magistry, Pierre & Lison, Pierre (2019). Should we use movie subtitles to study linguistic patterns of conversational speech? A study based on French, English and Taiwan Mandarin.
-
Lison, Pierre (2018). Anonymisering av rettsavgjørelser. NR-notat. SAMBA/07/18.
-
Lison, Pierre (2018). Data-driven models of reputation in cyber-security.
Show summary
In this talk, I will present our work on developing data-driven, predictive models of reputation (such as benign or malicious) for end-point hosts. I'll focus on two particular questions: 1) Malware often relies on so-called domain-generation algorithms (DGAs) to produce "fake" domain names that are used to connect compromised hosts with a command-and-control server. Many types of DGAs are been developed, from simple hashing techniques to more sophisticated approaches based on wordlists. I will show that these malware-generated domain names can be detected through recurrent neural networks such as LSTMs or GRUs. 2) The second part of the talk will focus on neural models of traffic reputation learned from passive DNS data. Passive DNS data are collections of inter-server DNS queries captured by sensors distributed on the network. This data is a goldmine for predicting whether a given domain name or IP address is likely to be benign or malicious. I will describe a deep neural architecture that predicts the reputation of end-point hosts with high accuracy. The neural model is trained on a large passive DNS dataset (745 million entries) and relies on a broad range of features extracted from the DNs graph.
-
Lison, Pierre (2018). Detecting Machine-translated Subtitles in Large Parallel Corpora.
Show summary
Parallel corpora extracted from online repositories of movie and TV subtitles are employed in a wide range of NLP applications, from language modelling to machine translation and dialogue systems. However, the subtitles uploaded in such repositories exhibit varying levels of quality. A particularly difficult problem stems from the fact that a substantial number of these subtitles are not written by human subtitlers but are simply generated through the use of online translation engines. This paper investigates whether these machine-generated subtitles can be detected automatically using a combination of linguistic and extra-linguistic features. We show that a feedforward neural network trained on a small dataset of subtitles can detect machine-generated subtitles with a F1-score of 0.64. Furthermore, applying this detection model on an unlabelled sample of subtitles allows us to provide a statistical estimate for the proportion of subtitles that are machine-translated (or are at least of very low quality) in the full corpus.
-
Lison, Pierre (2018). Modélisation du dialogue : systèmes de dialogue parlé et corpus multilingues.
-
Lison, Pierre (2018). Neural models for predicting the reputation of end-point hosts.
-
Lison, Pierre (2018). SAFERS: Talegjenkjenning og maskinlæring for nødmeldetjenester.
-
Lison, Pierre (2018). Tekstmining: En kort innføring.
-
Lison, Pierre; Tiedemann, Jörg & Kouylekov, Milen (2018). OpenSubtitles 2018: Statistical Rescoring of Sentence Alignments in Large, Noisy Parallel Corpora.
Show summary
Movie and TV subtitles are a highly valuable resource for the compilation of parallel corpora thanks to their availability in large numbers and across many languages. However, the quality of the resulting sentence alignments is often lower than for other parallel corpora. This paper presents a new major release of the OpenSubtitles collection of parallel corpora, which is extracted from a total of 3.7 million subtitles spread over 60 languages. In addition to a substantial increase in the corpus size (about 30 % compared to the previous version), this new release associates explicit quality scores to each sentence alignment. These scores are determined by a statistical regression model based on simple language-independent features and estimated on a small sample of aligned sentence pairs. Evaluation results show that the model is able predict lexical translation probabilities with a root mean square error of 0.07 (coefficient of determination R2 = 0.47). Based on the scores produced by this regression model, the parallel corpora can be filtered to prune out alignments with a score below a given threshold
-
Lison, Pierre (2017). Automatic Detection of Malware-Generated Domains with Recurrent Neural Models.
Show summary
Modern malware families often rely on domain-generation algorithms (DGAs) to determine rendezvous points to their command-and-control server. Traditional defence strategies (such as blacklisting domains or IP addresses) are inadequate against such techniques due to the large and continuously changing list of domains produced by these algorithms. This paper demonstrates that a machine learning approach based on recurrent neural networks is able to detect domain names generated by DGAs with high precision. The neural models are estimated on a large training set of domains generated by various malwares. Experimental results show that this data-driven approach can detect malware-generated domain names with a F1 score of 0.971. To put it differently, the model can automatically detect 93 % of malware-generated domain names for a false positive rate of 1:100.
-
Lison, Pierre (2017). Neural Reputation Models learned from Passive DNS Data.
-
Lison, Pierre (2017, 25. september). Opptreden i God Morgen Norge (TV2) for å vise Lenny roboten som ble brukt ved Forskningstorget.. [TV].
God Morgen Norge (TV2).
-
Lison, Pierre (2017). SAFERS - Speech Analytics for Emergency Response Services. Kan taleteknologi og maskinlæring brukes for å effektivisere nødmeldetjenester?.
-
Lison, Pierre & Bibauw, Serge (2017). Not all dialogues are created equal: instance weighting for neural conversational models.
-
Lison, Pierre & Kennington, Casey (2017). Incremental Processing for Neural Conversational Models.
Show summary
We presented a simple approach to make neural dialogue models 'incremental' - that is, able to operate on incremental units instead of on complete utterances. The model can handle insertions, commit and revoke operations as well as incremental units associated with probabilities. A proof-of-concept experiment on a visual reference resolution task shows the promise of the approach.
-
Lison, Pierre (2016). A short introduction to statistical machine translation.
Show summary
Machine translation (MT) systems such as Google Translate have become part of our daily life. But how do they work? In this talk, I'll explain how these systems are built. In the first part of my talk, I'll present a general overview of the field and the key ideas driving modern MT systems. In the second part, I'll dig deeper into the statistical techniques used to estimate translation models from data, and discuss some of the current hot topics in the field.
-
Lison, Pierre (2016). Automatic Turn Segmentation for Movie and TV Subtitles.
-
Lison, Pierre (2016). Automatic Turn Segmentation for Movie and TV Subtitles.
Show summary
Movie and TV subtitles contain large amounts of conversational material, but lack an explicit turn structure. This paper present a data-driven approach to the segmentation of subtitles into dialogue turns. Training data is first extracted by aligning subtitles with transcripts in order to obtain speaker labels. This data is then used to build a classifier whose task is to determine whether two consecutive sentences are part of the same dialogue turn. The approach relies on linguistic, visual and timing features extracted from the subtitles themselves and does not require access to the audiovisual material - although speaker diarization can be exploited when audio data is available. The approach also exploits alignments with related subtitles in other languages to further improve the classification performance. The classifier achieves an accuracy of 78 % on a held-out test set. A follow-up annotation experiment demonstrates that this task is also difficult for human annotators.
-
Lison, Pierre (2016). Dialogue modelling: small data and large data.
-
Lison, Pierre (2016). Hybrid dialogue management + dialogue modelling for MT.
-
Lison, Pierre (2016). OpenDial: A Toolkit for Developing Spoken Dialogue Systems with Probabilistic Rules.
-
Lison, Pierre (2016). OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles.
Show summary
We present a new major release of the OpenSubtitles collection of parallel corpora. The release is compiled from a large database of movie and TV subtitles and includes a total of 1689 bitexts spanning 2.6 billion sentences across 60 languages. The release also incorporates a number of enhancements in the preprocessing and alignment of the subtitles, such as the automatic correction of OCR errors and the use of meta-data to estimate the quality of each subtitle and score subtitle pairs.
-
Dragone, Paolo & Lison, Pierre (2015). Non-sentential utterances in dialogue: experiments in classification and interpretation.
-
Lison, Pierre (2015). Structured Probabilistic Modelling for Dialogue Management.
-
Lison, Pierre & Kennington, Casey (2015). Developing Spoken Dialogue Systems with the OpenDial toolkit.
-
Kosek, Michal Kajetan & Lison, Pierre (2014). An Intelligent Tutoring System for Learning Chinese with a Cognitive Model of the Learner.
-
Lison, Pierre (2014). Structured Probabilistic Modelling for Dialogue Management.
-
Lison, Pierre (2014). Structured Probabilistic Modelling for Dialogue Management. Series of dissertations submitted to the Faculty of Mathematics and Natural Sciences, University of Oslo.. 1452.
Show summary
This thesis presents a new modelling framework for dialogue management based on the concept of probabilistic rules. Probabilistic rules are defined as if...then...else constructions associating logical conditions on input variables to probabilistic effects over output variables. These rules function as high-level templates for the generation of a directed graphical model. Their expressive power allows them to represent the probabilistic models employed in dialogue management in a compact and efficient manner. As a consequence, they can drastically reduce the amount of interaction data required for parameter estimation as well as enhance the system's ability to generalise over unseen situations. Furthermore, probabilistic rules can also be exploited to encode domain-specific constraints and assumptions into statistical models of dialogue, thereby enabling system designers to incorporate their expert knowledge of the problem structure in a concise and human-readable form. Due to their integration of logical and probabilistic reasoning, we argue that probabilistic rules are particularly well suited to devise hybrid models of dialogue management that can account for both the complexity and uncertainty that characterise many dialogue domains. The thesis also demonstrates how the parameters of probabilistic rules can be efficiently estimated using both supervised and reinforcement learning techniques. In the case of supervised learning, the rule parameters are learned by imitation on the basis of small amounts of Wizard-of-Oz data. Alternatively, rule parameters can also be optimised via trial and error from repeated interactions with a (real or simulated) user. Both learning strategies rely on Bayesian inference to iteratively estimate the parameter values and provide the best fit for the observed interaction data. Three consecutive experiments conducted in a human--robot interaction domain attest to the practical viability of the proposed framework and its advantages over traditional approaches. In particular, the empirical results of a user evaluation with 37 participants show that a dialogue manager structured with probabilistic rules outperforms both purely hand-crafted and purely statistical methods on an extensive range of subjective and objective metrics of dialogue quality. The modelling framework presented in this thesis is implemented in a new software toolkit called Opendial, which is made freely available to the research community and can be used to develop various types of dialogue systems based on probabilistic rules.
-
Lison, Pierre & Meena, Raveesh (2014). Spoken Dialogue Systems: A New Frontier in Human-Computer Interaction. ACM Crossroads.
ISSN 1528-4972.
21(1)
-
Lison, Pierre (2013, 14. mai). Dr. Utenlansk.
Dagsavisen.
-
Lison, Pierre (2013). Kan man snakke med en robot?.
-
Lison, Pierre (2013). Model-based Bayesian Reinforcement Learning for Dialogue Management.
-
Lison, Pierre (2012). An Introduction to Machine Learning.
-
Lison, Pierre (2012). Dialogue Management with Probabilistic Rules.
-
Lison, Pierre (2012, 27. september). Kan roboter lære seg selv å snakke med mennesker?. [TV].
NRK.
-
Lison, Pierre (2012). Lenny, en robot som lærer å snakke.
-
Lison, Pierre (2012). Lenny, en robot som lærer å snakke.
-
Lison, Pierre (2012). Probabilistic Dialogue Models with Prior Domain Knowledge.
-
Lison, Pierre (2012). Social Robotics.
-
Lison, Pierre (2012). openDial: A toolkit for building dialogue systems based on probabilistic rules.
-
Lison, Pierre (2012). openDial: a dialogue systems toolkit based on probabilistic rules.
-
Lison, Pierre; Baumann, Timo; Friedberg, Heather; Götze, Jana; Janarthanam, Srini; Lorenzo, Alejandra & Meena, Raveesh (ed.) (2012). Proceedings of the 8th Young Researchers' Roundtable on Spoken Dialogue Systems (YRRSDS 2012).
-
Lison, Pierre (2011). Multi-Policy Dialogue Management.
Show summary
We present a new approach to dialogue management based on the use of multiple, interconnected policies. Instead of capturing the complexity of the interaction in a single large policy, the dialogue manager operates with a collection of small local policies combined concurrently and hierarchically. The meta-control of these policies relies on an activation vector updated before and after each turn.
-
Lison, Pierre (2011). Multi-policy Dialogue Management.
-
Lison, Pierre (2010). Towards relational POMDPS for adaptive dialogue management.
View all works in Cristin
Published Nov. 27, 2019 5:06 PM
- Last modified Nov. 27, 2019 5:06 PM