Oppgaven er ikke lenger tilgjengelig

Reference resolution for court cases

 

 

Court cases typically include many references to named entities such as person names (victims, defendants, witnesses & various third parties), locations and organisations. To protect the privacy of the individuals referred to in a court case, the document often needs to be de-identified, which means that specific categories of entities need to be replaced by coded values. For instance, in the example above the actual person names were replaced by "A", "B" and "C".

This editing process is done in several stages. A (neural) de-identification model is first run on the text and used to detect entity mentions that should be edited. A human expert then controls those suggestions and decides how to replace them with coded values. 

In this master thesis topic, we'll have a closer look at the reference resolution problem associated with de-identification. Given a set of detected entity mentions, we need to find out which ones refer to the same real-world entity. For instance, mentions such as "University of Oslo", "Universitetet i Oslo", "UiO", "Universitetet" and "Univ. of Oslo" will all refer to the same entity, which means they should be associated with the same coded value (such as "[institution1]"). The thesis will investigate how to solve this reference resolution problem in a robust manner, taking into account the document content as well as external resources (such as knowledge bases).

In practice, this thesis will work on a dataset of synthetic court cases generated from actual court cases from Lovdata. This thesis will be done in close collaboration with Lovdata, and Atle Oftedahl (who is a developer and ML expert at Lovdata) will be co-supervising this thesis. The other supervisor will be either Pierre Lison, Lilja Øvrelid or Ildikó Pilán depending on supervision capacity.

 

Prerequisites: Good programming skills & experience developing and evaluating NLP models. Good knowledge of Norwegian (bokmål & nynorsk).

Emneord: natural language processing, language technology, reference resolution, machine learning
Publisert 15. okt. 2020 10:29 - Sist endret 4. feb. 2021 16:21

Student(er)

  • Torbjørn Dahl

Omfang (studiepoeng)

60