I'm a PhD student in the Language Technology Group at the University of Oslo, as part of the dScience center. My main academic interest is language modeling, particularly how to make large pre-trained language models more efficient and effective. I also like to parse some semantic graphs from time to time.
Some of my recent projects:
Tags:
Natural Language Processing,
deep learning,
language models,
semantic parsing
Publications
-
Charpentier, Lucas Georges Gabriel & Samuel, David
(2023).
Not all layers are equally as important: Every Layer Counts BERT.
In Warstadt, Alex; Mueller, Aaron; Choshen, Leshem; Wilcox, Ethan; Zhuang, Chengxu; Ciro, Juan; Mosquera, Rafael; Paranjabe, Bhargavi; Williams, Adina; Linzen, Tal & Cotterell, Ryan (Ed.),
Proceedings of the BabyLM Challenge at the 27th Conference on Computational Natural Language Learning.
Association for Computational Linguistics.
ISSN 978-1-952148-02-6.
p. 238–252.
doi:
10.18653/v1/2023.conll-babylm.20.
-
Samuel, David
(2023).
Mean BERTs make erratic language teachers: the effectiveness of latent bootstrapping in low-resource settings.
In Warstadt, Alex; Mueller, Aaron; Choshen, Leshem; Wilcox, Ethan; Zhuang, Chengxu; Ciro, Juan; Mosquera, Rafael; Paranjabe, Bhargavi; Williams, Adina; Linzen, Tal & Cotterell, Ryan (Ed.),
Proceedings of the BabyLM Challenge at the 27th Conference on Computational Natural Language Learning.
Association for Computational Linguistics.
ISSN 978-1-952148-02-6.
p. 221–237.
doi:
10.18653/v1/2023.conll-babylm.19.
-
-
-
-
Samuel, David; Kutuzov, Andrei; Øvrelid, Lilja & Velldal, Erik
(2023).
Trained on 100 million words and still in shape: BERT meets British National Corpus.
In Vlachos, Andreas & Augenstein, Isabelle (Ed.),
Findings of the Association for Computational Linguistics: EACL 2023.
Association for Computational Linguistics.
ISSN 978-1-959429-47-0.
p. 1954–1974.
Show summary
While modern masked language models (LMs) are trained on ever larger corpora, we here explore the effects of down-scaling training to a modestly-sized but representative, well-balanced, and publicly available English text source – the British National Corpus. We show that pre-training on this carefully curated corpus can reach better performance than the original BERT model. We argue that this type of corpora has great potential as a language modeling benchmark. To showcase this potential, we present fair, reproducible and data-efficient comparative studies of LMs, in which we evaluate several training objectives and model architectures and replicate previous empirical results in a systematic way. We propose an optimized LM architecture called LTG-BERT.
-
-
-
-
Samuel, David & Straka, Milan
(2021).
ÚFAL at MultiLexNorm 2021: Improving Multilingual Lexical Normalization by Fine-tuning ByT5,
Proceedings of the Seventh Workshop on Noisy User-generated Text (W-NUT 2021).
Association for Computational Linguistics.
ISSN 978-1-954085-90-9.
p. 483–492.
Show summary
We present the winning entry to the Multilingual Lexical Normalization (MultiLexNorm) shared task at W-NUT 2021 (van der Goot et al., 2021a), which evaluates lexical-normalization systems on 12 social media datasets in 11 languages. We base our solution on a pre-trained byte-level language model, ByT5 (Xue et al., 2021a), which we further pre-train on synthetic data and then fine-tune on authentic normalization data. Our system achieves the best performance by a wide margin in intrinsic evaluation, and also the best performance in extrinsic evaluation through dependency parsing. The source code is released at https://github.com/ufal/multilexnorm2021 and the fine-tuned models at https://huggingface.co/ufal.
-
Samuel, David & Straka, Milan
(2020).
ÚFAL at MRP 2020: Permutation-invariant Semantic Parsing in PERIN.
In Oepen, Stephan; Abend, Omri; Abzianidze, Lasha; Bos, Johan; Hajic, Jan; Hershcovich, Daniel; Li, Bin; O'Gorman, Tim & Zeman, Daniel (Ed.),
Proceedings of the CoNLL 2020 Shared Task:Cross-Framework Meaning Representation Parsing.
Association for Computational Linguistics.
ISSN 9781952148644.
p. 53–64.
doi:
10.18653/v1/2020.conll-shared.5.
Show summary
We present PERIN, a novel permutation-invariant approach to sentence-to-graph semantic parsing. PERIN is a versatile, cross-framework and language independent architecture for universal modeling of semantic structures. Our system participated in the CoNLL 2020 shared task, Cross-Framework Meaning Representation Parsing (MRP 2020), where it was evaluated on five different frameworks (AMR, DRG, EDS, PTG and UCCA) across four languages. PERIN was one of the winners of the shared task. The source code and pretrained models are available at http://www.github.com/ufal/perin.
View all works in Cristin
Published
Sep. 30, 2021 5:32 PM
- Last modified
Oct. 8, 2023 11:06 PM