Informative masks for text anonymisation
Many types of text documents - such as medical records, court decisions, or even social media data - are rich in personal (sometimes sensitive) information. For privacy reasons, it is often useful to anonymise those documents such as to conceal the identity of the individuals mentioned or described there. This anonymisation process goes through two main steps:
- The first is to detect in the text all personal identifiers (such as person names, addresses, physical characteristics, contact info, dates, locations, workplace, etc.) that may help re-identify the person
- The second is to mask/edit those personal identifiers to make it more difficult to re-identify the person in question.
This master thesis will specifically focus on this second part: assuming that our anonymization model has managed to detect a personal identifier (say, the name of a person or a location), how can we replace this text span with a more general expression that conceals the person identity, but remains nevertheless more informative than a simple deletion?
For instance, if a medical record mentions that the patient lives in Hundorp, we might want to replace this text span with a more general mention such as [village in Gudbrandsdalen]. This replacement would need to rely on ontologies like those available on Wikipedia/Wikidata (which would tell us that Hundorp is a village and is located in Gudbrandsdalen). As there will typically be a large number of possible replacement solutions in a given document, the thesis will investigate how to use optimisation techniques to find the "optimal masks", according to two objectives: reduce the risk of re-identification, but also be as informative as possible.
This master thesis is part of the broader research project on text anonymisation, see CLEANUP.
Good programming skills in Python, interest in data privacy. Some prior experience with ontologies and/or optimisation algorithms is an advantage (but not a requirement). The student taking this topic must be enrolled in the M.Sc. in Informatics: Language Technology.