A study of big data computing platforms : performance, fairness and energy consumption
Date of Issue2017-09-14
Interdisciplinary Graduate School (IGS)
Today, population aging is a trend spreading across the world. It is estimated that the number of people aged 60 years or over, is increasing rapidly. Although the elderly has wisdom and wealth gathered from their life experience, they often require long-term assistance from others. Ambient assisted living (AAL) systems open up a new opportunity to address the needs of aged by utilizing information and communication techniques. Multiple sensors and actuators coexist in ambient assisted living systems and are deployed everywhere surrounding elderly. A huge volume of sensed data is collected continuously from elderly, storage of these massive raw data and efficient processing on them to infer knowledge from the data and also guide the actuators to meet the needs of elderly is a problem we face. In the thesis, we mainly focus on the efficient large-scale data processing, we investigate whether and how we can build an effective large-scale data processing system to effectively excavate knowledge, wisdom and skills from elderly people to impact the development of the entire society. A lot of large-scale big data analytics systems have thereby emerged to process the massive data effectively. Efficient job scheduling and resource management for these data analytic frameworks are nontrivial. Modern job schedulers and resource coordinators in data processing frameworks often need to consider multiple objectives simultaneously due to various system operators’ requirements on data analytics. Currently, resource efficiency (throughput), job latency (per-job performance), fairness (isolation guarantee) and energy consumption are important concerns for the job scheduler in modern large-scale multiple-tenant environments. Resource efficiency is de facto the very important factor for the big data analytic framework. Job latency reflects the waiting time of the application. Fairness is a key building block of any multiple-tenant computing system that allows resource sharing effectively. Energy usage of the data center has reached 3% of the global electricity consumption while generating 200 million metric tons of CO2 in 2014. In order to reduce carbon emission and financial burden on the electricity, a lot of data centers have been re-designed and powered with multiple energy sources, including renewable (green) energy from non-polluting sources and brown energy from traditional polluting sources. Improving the resource efficiency and reducing the per-job latency are the common sense on those large-scale data processing frameworks and fruitful studies are proposed in these directions. In contrast, the fairness and energy consumption of those frameworks need further exploration and how the resource efficiency, job latency, fairness and energy consumption interact each other on big data computing frameworks is not well addressed. In the thesis, we perform detailed study on the resource efficiency, job latency, fairness and energy consumption of big data processing frameworks and find that these objectives can be translated to discordant actions. We propose bi-criteria optimization algorithms and finally implement a general multiple-objective optimization system to address the tradeoff between different objectives. First, we explore the tradeoff between resource efficiency and fairness in detail and find that the tradeoff is related to the workload. Therefore, we should develop a scheduler to be aware of workload dynamics and the efficiency-fairness tradeoff. Since researchers keep inventing new scheduling algorithms, we develop a meta-scheduler called FLEX which addresses the efficiency-fairness tradeoff by leveraging existing efficiency- and/or fairness-optimized schedulers. FLEX adaptively chooses the most suitable scheduler at runtime according to the variation of the workload and user-defined SLAs. We have implemented FLEX in Hadoop YARN. We conduct experiments with real deployment in a local cluster and perform simulation studies with production traces. Flex performs better than the state-of-the-art scheduling algorithm in two aspect: 1) Given a predefined threshold on the fairness loss, FLEX reduces the makespan by up to 22% and 24% in real deployment and large-scale simulation, respectively; 2) Given the predefined threshold on the makespan reduction, it reduces the fairness loss by up to 75% and 73% in real deployment and large-scale simulation, respectively. Second, we study the tradeoff between resource efficiency and energy consumption in the modern datacenters which have been re-designed and integrated with the intelligence of smartly drawing power from multiple sources, including green energy from renewable and non-polluting sources and brown energy is drawn from electric grid when renewable is insufficient. We find that not all joules are equal in sense that the amount of work that can be done by a joule can vary significantly. We investigate how to exploit such joule efficiency to maximize the benefits of renewable energy and dynamic pricing for MapReduce framework while satisfying the workload deadline. We have developed JouleMR, a cost-effective and green-aware MapReduce framework. We implement JouleMR on top of hadoop YARN, and evaluate it in both real local cluster and large-scale simulator. JouleMR significantly reduces the brown energy on both real experiments and simulations (up to 35% and 28% reduction compared with the state-of-the-art systems). Additionally, Joule MR reduces the electricity cost on both real experiments and simulations compared to the state-of-the-art work (by 30% and 36% reduction, respectively). Third, we further explore the relationship between multiple objectives and develop an efficient framework for multi-objective optimizations on geo-distributed data analytics systems. We observe that these objectives can be translated to discordant actions and their relationship can be impacted by the unique features of geo-distributed data analytics systems. We formulate the multi-objective optimization problem mathematically and propose an efficient online heuristic algorithm to perform the multi-objective optimization for the geo-distributed data analytics systems. We develop GeoSpark, an extension to Spark, which automatically performs a multi-objective optimization according to the system operators’ preferences on different objectives. GeoSpark effectively performs the multi-objective optimizations based on system operators’ preferences on different objectives. GeoSpark achieves up to 30% makespan reduction, 28% job latency reduction and better fairness guarantee compared with existing schedulers in Apache Spark in the geo-distributed setting. In summary, this thesis mainly aims at the resource efficiency, job latency, fairness and energy consumption of the job scheduler and their relationships for large-scale data computing frameworks. We propose bi-criteria optimization algorithms to address the tradeoff between different objectives and finally develop a general multiple-objective optimization system.