Topic Segmentation and Term Disambiguation for NAV
As part of the services provided to the public by the Norwegian Labour and Welfare Administration (NAV) there is a need to extract structured information from job advertisements that are posted in free-text form. Currently, NAV applies manual, human coding to categorize job advertisements according to the type of position and the employer. This is a costly and error-prone process, and leaves much to be desired in terms of the range of desirable structured information about an available position, as for example specific requirements on the training, competencies, and expertise of candidate employees.
Since around 2002, NAV has curated a database of some 2.5 million textual job advertisements that have been manually categorized. As part of an ongoing internal project on developing a new platform to ‘match’ available positions and job-seekers, the internal Artificial Intelligence Laboratory at NAV seeks to facilate an MSc project on pre-processing of job advertisements to better support manual categorization and, if possible, enable partial or full automation of the process of extracting relevant information from these documents. Abstractly, this part of the project could be approached as an instance of topic segmentation and topic modeling.
Additionally, NAV has available a very large ontology of employment- and job-seeker–related concepts, for example degrees in various subject areas. A subsidiary goal of this MSc project could be to identify textual units in the job advertisement that likely name one of the relevant concepts: a phrase like Bachelor degee in Language Technology, for example, if used as a requirement on applicants in the job advertisement will likely correspond to several ontology nodes, viz. for a specific level of higher-education degree and for a specific area of scientific specialization. There will typically be a potentially large number of different natural language expressions to name the same concept, and not all occurrences of these expressions will necessarily form a relevant unit. This sub-project, time-permitting, could be viewed as calling for the combination of general techniques like part-of-speech tagging, named entity recognition, and chunking, as well as adaptation of methods for so-called term disambiguation and grounding (in a domain specific knowledge base).
The project will be practical and experimental in nature, although some theoretical modeling will likely also be called for in, for example, deciding on the topic structure of job advertisements and the type of the units involved. Good programming expertise and at least basic knowledge of natural language processing are a prerequisite for this project, for example of the level of INF4820: Algorithms for Artificial Intelligence and Natural Language Processing.