As-a-Service Approach Data Cleaning and Transformation of Big Data
Traditionally, data cleaning and transformation processes are specified and triggered by user interaction – e.g., writing commands in a Python script and then executing it, specifying formulas in Excel. Thereby, data transformations consist of steps (or scripts) that are statically executed over input datasets and the result is returned to the agent who triggered them (e.g., table in a tool, terminal output stream) for further processing.
However, nowadays data are generated and deployed in a distributed setting by multiple parties or devices (e.g., sensors, software products), often across random time intervals, factors, which make the traditional static processes unsuitable for data processing. The task of this thesis would be to design and implement an event-driven data transformation approach that enables users to automatically deploy transformation processes in a distributed setting ("as-a-Service" over the Internet) and enable to trigger them on specific events (e.g., data are uploaded somewhere).
What is the context of the thesis?
The thesis will explore data cleaning and transformation approaches to identify the applicability of each in the context of event-driven execution. Furthermore, it will explore the state-of-the-art in various technologies such as Web Services, Docker, Function-as-a-Service with the goal of identifying the best mixture of tools and (cloud) services to enable event-driven execution. The thesis will address some or all of the following questions:
- How to set up a transformation procedure so that it can be easily deployed as a service? What kinds of approaches are there and which ones fit?
- Which technologies are applicable in the context of event-driven execution of data transformations?
- How can the process of deployment of data transformations be made to scale (automatically)?
- Is such an approach applicable with stream and/or batch data cleaning and transformations?
- How can the process be made transparent to the user?
What are the practical aspects of the thesis?
The practical side of the thesis will involve creating a prototype for event-driven data transformation. This will mean setting up an automated system to compile data transformations and deploy them on the Cloud. Data transformations will be compiled through the DataGraft platform's Grafterizer tool, which can generate packaged executable JAR or WAR files based on a user-defined data cleaning and transformation pipeline.
Who is the thesis for?
Students passionate about cloud and web-service technologies and eager to extend the state-of-the-art in data cleaning and transformation approaches.