Extracting unseen entities for peace science research

Context. Most of the information available about conflicts takes the form of news articles. These articles do not follow a simple template and can convey the same information in linguistically diverse forms. Peace scientists need conflict data in a rigid format to facilitate large-scale comparisons and to feed algorithms tackling problems such as conflict forecasting. The current approach is to use human expert annotators who read news articles and translate them into structured conflict information. The Peace Science Infrastructure project was recently launched by PRIO (Peace Research Institute Oslo) in collaboration with LTG to automate this process for the Uppsala Conflict Data Program Georeferenced Event Dataset (UCDP GED).

Research Problem. In this context, information extraction models have to deal with previously unseen names. This begs the question: how well do language models generalize to documents published after their training set? In particular, when identifying entities in news reports that might contain the first-ever occurrence of a specific entity, retrieval-augmented language models might not be able to provide additional help. Furthermore, Named Entity Recognition and Classification (NERC) models tend to use lists of known entities called gazetteers as a form of – distant – supervision (Liang et al. 2020), this might further bias the model away from recognizing unseen entities. The same question can also be tackled for entity linking models, which tend to rely on entity descriptions to tackle the problem of unseen entities. (Wu et al. 2020).

References

Emneord: language technology, information extraction, peace science
Publisert 9. okt. 2023 14:43 - Sist endret 9. okt. 2023 14:56

Veileder(e)

Omfang (studiepoeng)

60