Multi-label Sentence Classification for Technical Documents

SIRIUS (Center for Scalable Data Access in the Oil & Gas domain)is a  Center for Research-Driven Innovation hosted by the University of Oslo  and involves both academic teams (UiO, NTNU and Oxford University), as  well as industrial partners (Statoil, IBM, Schlumberger). The center has  as its main goal to develop novel technologies to improve our ability to  extract and exploit information from large data stores in the oil and  gas domain. The research and technology development within the center  aims to develop a set of technology strands to exploit information from  enormous data stores in practice. It consists of different but coherent  technologies: High performance computing, Cloud computing, Database  technology, Semantic technology and Language technology to assist both  well exploration and in-field operational activities in the Oil & Gas  domain. This Master project is devoted to research in the language  technology strand. 

The task of the exploration department at Statoil is to find exploitable  deposits of hydrocarbons (oil or gas). Geoscientists in the exploration  department model the subsurface geography by classifying rock layers  according to multiple stratigraphic hierarchies using information from a  wide range of different sources. The quality of the analysis depends on  the availability and the ease of accessing to the relevant data.  Previous technical studies, reports and surveys are crucial resources in  this process.   The main objective of the thesis is to implement a  multi-label sentence classification model to find various geological  type properties inside the exploration textual data. We aim to take  advantage of recent advanced in deep learning techniques like  Convolutional Neural Network (CNN) and Long Short-Term Memory network  (LSTM) to improve the current baseline. The available dataset is  unbalanced among the properties; therefore, we need to find a solution  to decrease the impact of unbalanced data in the classification model. 

The candidate should be able to program in Python or Java and willing to  learn deep learning techniques in NLP tasks. 

Publisert 3. okt. 2017 20:51 - Sist endret 18. okt. 2018 10:26

Omfang (studiepoeng)

60