Benchmarking Performance of Data-Intensive Applications on Heterogeneous Platforms
Data-intensive computing, often simply referred to as big data, is one of the major current trends in information and communication technology. In areas as diverse as social media, business intelligence, information security, Internet-of-Things, and scientific research, a tremendous amount of, possibly unstructured, data is created or collected at a speed surpassing what we can handle using traditional data management techniques. Data-intensive applications in a cluster generally used to rely on a co-located big data compute framework and storage system. An example of such a configuration is a Hadoop MapReduce running on a co-located Hadoop Distributed File System (HDFS) cluster . MapReduce applications in this example will access the shared storage provided by the HDFS on distributed data processing nodes. However, as the big data ecosystem is rapidly evolving, a whole range of big data processing frameworks, technologies, and storage systems have been developed in the last several years. The current ecosystem involves a variety of compute frameworks, ranging from traditional Hadoop MapReduce and Apache Spark  to specialised frameworks such as Apache Flink , Storm , and Samza , to name a few. The storage systems are equally plentiful. A variety of open source and enterprise distributed file systems, database management systems, and cloud-specific storage technologies are in use. Moreover, modern big data applications often manage multiple data sources, requiring separate management of namespace and access APIs for each data source. This heterogeneity of storage technologies bring interoperability and performance issues together with costly storage integrations. Moreover, unified benchmarks for assessing data access performance are not readily available resulting in poor use-case decisions.
Description of Topic:
In this thesis, we propose to understand the impact of distribution and data locality on the application performance of data-intensive applications and design benchmarks to assess the quality of the data-processing stacks. The idea is to show the impact of both distribution and data-locality over heterogeneous platforms and then potentially take advantage of lineage (through Alluxio ) with write caches and intelligent tiered storage to target improvements in data-processing performance. Alluxio is a rapidly growing open source memory speed virtual distributed storage system enabling big data applications to interact with data from a variety of storage systems and technologies.
The goal of this master project is to design and develop benchmarks for evaluating performance of various combinations of data-processing frameworks, distributed file systems, and storage technologies using Alluxio as a unified virtual distributed storage system.
What you will do:
- You will design a benchmarking methodology for data-intensive applications developed under state-of-the-art data processing frameworks and storage technologies
- You would setup data-processing stacks and do experiments for identifying performance bottlenecks in a variety of application scenarios.
- You will explore hooking Alluxio APIs in the job schedulers at appropriate places to bring data on a node / cache in memory (or move to a desired tier) / persist back to the source to improve application performance.
- Potentially publish your results at a good scientific venue!
What you will learn:
- You will get to know state-of-the-art data-processing frameworks, distributed files systems, and storage technologies.
- You will develop a thorough understanding of workload execution in a distributed data centre
- You will learn performance analysis and develop insight into algorithm evaluation
- First-class programming skills in Java (or C/C++).
- Previous experience and/or knowledge of data-intensive computing and distributed applications is a plus.
For more information, please contact Feroz Zahid (email@example.com) or Silvia Lizeth Tapia Tarifa (firstname.lastname@example.org).
 Spark: Lightning-fast Unified Analytics Engine
 Flink: Stateful Computations over Data Streams
 Storm: Realtime Computations
 Samza: Distributed Stream Processing Framework
 Alluxio: Open Source Memory Speed Virtual Distributed Storage