Multi-label Sentence Classification for Technical Documents
SIRIUS (Center for Scalable Data Access in the Oil & Gas domain)is a Center for Research-Driven Innovation hosted by the University of Oslo and involves both academic teams (UiO, NTNU and Oxford University), as well as industrial partners (Statoil, IBM, Schlumberger). The center has as its main goal to develop novel technologies to improve our ability to extract and exploit information from large data stores in the oil and gas domain. The research and technology development within the center aims to develop a set of technology strands to exploit information from enormous data stores in practice. It consists of different but coherent technologies: High performance computing, Cloud computing, Database technology, Semantic technology and Language technology to assist both well exploration and in-field operational activities in the Oil & Gas domain. This Master project is devoted to research in the language technology strand.
The task of the exploration department at Statoil is to find exploitable deposits of hydrocarbons (oil or gas). Geoscientists in the exploration department model the subsurface geography by classifying rock layers according to multiple stratigraphic hierarchies using information from a wide range of different sources. The quality of the analysis depends on the availability and the ease of accessing to the relevant data. Previous technical studies, reports and surveys are crucial resources in this process. The main objective of the thesis is to implement a multi-label sentence classification model to find various geological type properties inside the exploration textual data. We aim to take advantage of recent advanced in deep learning techniques like Convolutional Neural Network (CNN) and Long Short-Term Memory network (LSTM) to improve the current baseline. The available dataset is unbalanced among the properties; therefore, we need to find a solution to decrease the impact of unbalanced data in the classification model.
The candidate should be able to program in Python or Java and willing to learn deep learning techniques in NLP tasks.