High performance database systems on coupled CPU-GPU architectures
Date of Issue2016
School of Computer Engineering
Parallel and Distributed Computing Centre
Database systems have been widely used in a large range of applications to provide users with functions to store, modify and extract information from a huge volume of data. In recent years, with constantly increasing data volumes and the emerging of real-time data analytics workloads such as decision support systems, the demands for high performance query execution are becoming more intensive than ever before. Database community has devoted a lot of efforts to improving database query execution performance from various aspects. Among these efforts, exploitation of emerging hardware has become an effective and efficient approach. Graphics Processing Units (GPUs) are originally designed for graphics workloads. In recent years, programming on GPUs for general-purpose tasks has been significantly simplified with the release of programming interfaces. They are ideal platforms for workloads in database systems with abundant Data Level Parallelism (DLP). Conventional GPU (device) is used as a discrete co-processor connected to the CPU (host) via PCI-e bus. Existing studies have demonstrated GPU query co-processing is an effective means for improving the performance of main memory OLAP (Online Analytical Processing) databases. However, the relatively low bandwidth and high latency of the PCI-e bus are usually the major bottleneck for query co-processing performance. Recently, a novel coupled CPU-GPU architecture has been implemented by multiple vendors. That opens up new opportunities for optimizing query co-processing. In this thesis, we investigate these opportunities on such coupled CPU-GPU architectures and propose to implement hash joins and a complete query processing engine that can fully take advantage of these new hardware characteristics. Specifically, we start with studying the fine-grained co-processing mechanisms on hash joins, one of the most important operators in database systems, with and without partitioning. The co-processing outlines an interesting design space. We extend existing cost models to automatically guide decisions on the design space. Our experimental results show that the fine-grained hash joins can outperform the CPU-only, GPU-only and conventional CPU-GPU co-processing by 53%, 35% and 28%, respectively. However, such fine-grained operator designs still suffer from serious memory stalls because the main memory bandwidth of such coupled CPU-GPU architectures is much lower than that of a discrete GPU. To overcome this obstacle and further apply coupled CPU-GPU architectures in a wider range of areas, we propose a novel in-cache query co-processing paradigm by exploiting the shared cache capability. Specifically, we adapt CPU-assisted prefetching to minimize cache misses in GPU query co-processing and CPU-assisted decompression to improve query execution performance. Furthermore, we develop a cost model guided adaptation mechanism for distributing the workload of prefetching, decompression, and query execution between CPU and GPU. The experimental results show that our in-cache query co-processing with workload distribution adaptation mechanism can improve the query execution performance over the state-of-the-art GPU co-processing by up to 36% and 40% on two AMD APUs, respectively. Though fine-grained hash joins and in-cache query co-processing engine have explored various designs to optimally utilize the strengths of the coupled architectures, they still fail to expose the inherent concurrency in each database query. Both of them use a kernel-based execution approach which executes the GPU kernel one by one and optimize individual kernels for resource utilization and performance improvement. Thus, we further propose a novel GPU-based pipelined query execution engine named GPL for more concurrency and higher device utilization. Different from the existing kernel-based execution, GPL takes advantage of hardware features of new-generation GPUs including concurrent kernel execution and efficient data communication channel between kernels. We use the tiling technique to logically partition the input data into smaller data tiles so that the pipelined query plan can be adapted in a cost-based manner. We have conducted extensive experiments on AMD and NVIDIA GPUs. As the results show, GPL is able to significantly outperform the state-of-the-art kernel-based query processing approaches with improvement up to 50%.
DRNTU::Engineering::Computer science and engineering::Information systems::Database management