Thạc Sĩ Performance Evaluation of Distributed SQL Query Engines and Query Time Predictors

Phí Lan Dương · 13/10/15

Abstract
With the decrease in cost of storage and computation of public clouds, even small
and medium enterprises (SMEs) are able to process large amounts of data. This
causes businesses to increase the amounts of data they collect, to sizes that are
difﬁcult for traditional database management systems to handle. Distributed SQL
Query Engines (DSQEs), which can easily handle these kind of data sizes, are
therefore increasingly used in a variety of domains. Especially users in small
companies with little expertise may face the challenge of selecting an appropri-
ate engine for their speciﬁc applications. A second problem lies with the variable
performance of DSQEs. While all of the state-of-the-art DSQEs claim to have very
fast response times, none of them has performance guarantees. This is a serious
problem, because companies that use these systems as part of their business do
need to provide these guarantees to their customers as stated in their Service Level
Agreement (SLA).
Although both industry and academia are attempting to come up with high level
benchmarks, the performance of DSQEs has never been explored or compared in-
depth. We propose an empirical method for evaluating the performance of DSQEs
with representative metrics, datasets, and system conﬁgurations. We implement
a micro-benchmarking suite of three classes of SQL queries for both a synthetic
and a real world dataset and we report response time, resource utilization, and
scalability. We use our micro-benchmarking suite to analyze and compare three
state-of-the-art engines, viz. Shark, Impala, and Hive. We gain valuable insights
for each engine and we present a comprehensive comparison of these DSQEs. We
ﬁnd that different query engines have widely varying performance: Hive is always
being outperformed by the other engines, but whether Impala or Shark is the best
performer highly depends on the query type.
In addition to the performance evaluation of DSQEs, we evaluate three query
time predictors of which two are using machine learning, viz. multiple linear re-
gression and support vector regression. These query time predictors can be used as
input for scheduling policies in DSQEs. The scheduling policies can then change
query execution order based on the predictions (e.g., give precedence to queries
that take less time to complete). We ﬁnd that both machine learning based predict-
ors have acceptable performance, while a baseline naive predictor is more than two
times less accurate on average.ivPreface
Ever since I started studying Computer Science I have been fascinated about the
ways tasks can be distributed over multiple computers and be executed in paral-
lel. Cloud Computing and Big Data Analytics appealed to me for this very reason.
This made me decide to conduct my thesis project at Azavista, a small start-up
company based in Amsterdam specialised in providing itinerary planning tools for
the meeting and event industry. At Azavista there is a particular interest in provid-
ing answers to analytical questions to customers in near real-time. This thesis is
the result of the efforts to realise this goal.
During the past year I have learned a lot in the ﬁeld of Cloud Computing, Big Data
Analytics, and (Computer) Science in general. I would like to thank my supervisors
Prof.dr.ir. D.H.J Epema and Dr.ir. A. Iosup for their guidance and encouragement
throughout the project. Me being a perfectionist, it was very helpful to know when
I was on the right track. I also want to thank my colleague and mentor Dr. Jos´e M.
Vi˜na Rebolledo for his many insights and feedback during the thesis project. I am
very grateful both him and my friend Jan Zah´alka helped me understand machine
learning, which was of great importance for the second part of my thesis.
I want to thank my company supervisors Robert de Geus and JP van der Kuijl
for giving me the freedom to experiment and providing me the ﬁnancial support
for running experiments on Amazon EC2. Furthermore I want to also thank my
other colleagues at Azavista for the great time and company, and especially Mervin
Graves for his technical support.
I want to thank Sietse Au, Marcin Biczak, Mihai Capotˇa, Bogdan Ghit¸, Yong
Guo, and other members of the Parallel and Distributed Systems Group for sharing
ideas. Last but not least, I want to also thank my family and friends for providing
great moral support, especially during the times progress was slow.
Stefan van Wouw
Delft, The Netherlands
10th October 2014
vviContents
Preface v
1 Introduction 1
1.1 Problem Statement . 2
1.2 Approach 3
1.3 Thesis Outline and Contributions 3
2 Background and Related Work 5
2.1 Cloud Computing 5
2.2 State-of-the-Art Distributed SQL Query Engines . 10
2.3 Related Distributed SQL Query Engine Performance Studies . 15
2.4 Machine Learning Algorithms . 16
2.5 Principal Component Analysis . 21
3 Performance Evaluation of Distributed SQL Query Engines 23
3.1 Query Engine Selection . 23
3.2 Experimental Method 23
3.2.1 Workload 24
3.2.2 Performance Aspects and Metrics . 25
3.2.3 Evaluation Procedure 26
3.3 Experimental Setup . 26
3.4 Experimental Results 29
3.4.1 Processing Power 29
3.4.2 Resource Consumption . 31
3.4.3 Resource Utilization over Time 33
3.4.4 Scalability . 33
3.5 Summary 36
4 Performance Evaluation of Query Time Predictors 39
4.1 Predictor Selection . 39
4.2 Perkin: Scheduler Design 40
4.2.1 Use Case Scenario . 40
4.2.2 Architecture . 41
4.2.3 Scheduling Policies . 41
vii4.3 Experimental Method 43
4.3.1 Output Traces 43
4.3.2 Performance Metrics 47
4.3.3 Evaluation Procedure 48
4.4 Experimental Results 49
4.5 Summary 51
5 Conclusion and Future Work 53
5.1 Conclusion . 53
5.2 Future Work . 54
A Detailed Distributed SQL Query Engine Performance Metrics 61
B Detailed Distributed SQL Query Engine Resource Utilization 65
C Cost-based Analytical Modeling Approach to Prediction 69
D Evaluation Procedure Examples 73
viiiChapter 1
Introduction
With the decrease in cost of storage and computation of public clouds, even small
and medium enterprises (SMEs) are able to process large amounts of data. This
causes businesses to increase the amounts of data they collect, to sizes that are
difﬁcult for traditional database management systems to handle. Exactly this chal-
lenge was also encountered at Azavista, the company this thesis was conducted
at. In order to assist customers in planning itineraries using its software for event
and group travel planning, Azavista processes multi-terabyte datasets every day.
Traditional database management systems that were previously used by this SME
simply did not scale along with the size of the data to be processed.
The general phenomenon of exponential data growth has led to Hadoop-oriented
Big Data Processing Platforms that can handle multiple terabytes to even petabytes
with ease. Among these platforms are stream processing systems such as S4 [44],
Storm [22], and Spark Streaming [64]; general purpose batch processing systems
like Hadoop MapReduce [6] and Haloop [25]; and distributed SQL query engines
(DSQEs) such as Hive [53], Impala [15], Shark [59], and more recently, Presto
[19], Drill [35], and Hive-on-Tez [7].
Batch processing platforms are able to process enormous amounts of data (tera-
bytes and up) but have relatively long run times (hours, days, or more). Stream
processing systems, on the other hand, have immediate results when processing a
data stream, but can only perform a subset of algorithms due to not all data be-
ing available at any point in time. Distributed SQL Query Engines are generally
built on top of (a combination of) stream and batch processing systems, but they
appear to the user as if they were traditional relational databases. This allows the
user to query structured data using an SQL dialect, while at the same time having
much higher scalability than traditional databases. Besides these different systems,
hybrids also do exist in form of so called lambda architectures [17], where data
is both processed by a batch processing system and by a stream processor. This
allows the stream processing to get fast but approximate results, while in the back
the batch processing system slowly computes the results accurately.
In this work we focus on the DSQEs and their internals, since although authors
1claim them to be fast and scalable, none of them provides deadline guarantees
for queries with deadlines. In addition, no in-depth comparisons between these
systems are available.
1.1 Problem Statement
Selecting the most suitable of all available DSQEs for a particular SME is a big
challenge, because SMEs are not likely to have the expertise and the resources
available to perform an in-depth study. Although performance studies do exist for
Distributed SQL Query Engines [4, 16, 33, 34, 47, 59], many of them only use
synthetic workloads or very high-level comparisons that are only based on query
response time.
A second problem lies with the variable performance of DSQEs. While all of
the state-of-the-art DSQEs claim to have very fast response times (seconds instead
of minutes), none of them has performance guarantees. This is a serious prob-
lem, because companies that use these systems as part of their business do need to
provide these guarantees to their customers as stated in their Service Level Agree-
ment (SLA). There are many scenarios where multiple tenants1 are using the same
data cluster and resources need to be shared (e.g., Google’s BigQuery). In this case,
queries might take much longer to complete than in a single-tenant environment,
possibly violating SLAs signed with the end-customer. Related work provides a
solution to this problem in form of BlinkDB [24]. This DSQE component for
Shark does provide bounded query response times, but at the cost of less accurate
results. However, one downside of this component is that it is very query engine
dependent, as it uses a cost-based analytical heuristic to predict the execution time
of different parts of a query.
In this thesis we try to address the lack of in-depth performance evaluation of
the current state-of-the-art DSQEs by answering the following research question:
RQ1 What is the performance of state-of-the-art Distributed SQL Query Engines
in a single-tenant environment?
After answering this question we evaluate an alternative to BlinkDB’s way of
predicting query time by answering the following research question:
RQ2 What is the performance of query time predictors that utilize machine learn-
ing techniques?
Query time predictors are able to predict both the query execution time and
query response time. The query execution time is the time a query is actively
being processed, whereas the query response time also takes the queue wait time
1
A tenant is an actor of a distributed system that represents a group of end-users. For example, a
third party system that issues to analyze some data periodically, in order to display it to all its users
on a website.
2into account. The predicted execution time can be used to change the query exe-
cution order in the system as to minimize response time. We are particularly inter-
ested in applying machine learning techniques, because it has been shown to yield
promising results in this ﬁeld [60]. In addition, machine learning algorithms do
not require in-depth knowledge of inner DSQE mechanics. Thus, machine learn-
ing based query time predictors can easily be applied to any query engine, while
BlinkDB is tied to many internals of Shark.
1.2 Approach
To answer the ﬁrst research question (RQ1) we deﬁne a comprehensive perform-
ance evaluation method to assess different aspects of query engines. We com-
pare Hive, a somewhat older but still widely used query engine, with Impala and
Shark, both state-of-the-art distributed query engines. This method can be used to
compare current and future query engines, despite not covering all the methodo-
logical and practical aspects of a true benchmark. The method focuses on three
performance aspects: processing power, resource utilization and scalability. With
the results from this study, system developers and data analysts can make informed
choices related to both cluster infrastructure and query tuning.
In order to answer research question two (RQ2), we evaluate three query time
predictor methods, namely Multiple Linear Regression (MLR), Support Vector Re-
gression (SVR), and a base-line method Last2. We do this by designing a workload
and training the predictors on the output traces of the three different ways we ex-
ecuted this workload. Predictor accuracy is reported by using three complementary
metrics. The results from this study allow engineers to select a predictor for use in
DSQEs that is both fast to train and accurate.
1.3 Thesis Outline and Contributions
The thesis is structured as follows: In Chapter 2 we provide background informa-
tion and related work, including the deﬁnition of the Cloud Computing paradigm,
an overview of state-of-the-art Distributed SQL Query Engines and background in-
formation regarding machine learning. In Chapter 3 we evaluate the state-of-the-art
Distributed SQL Query Engines’ performance on both synthetic real-world data. In
Chapter 4 we evaluate query time predictors that use machine learning techniques.
In Chapter 5 we present conclusions to our work and describe directions for future
work.
Our main contributions are the following:
ã We propose a method for performance evaluation of DSQEs (Chapter 3),
which includes deﬁning a workload representative for SMEs as well as de-
ﬁning the performance aspects of the query engines: processing power, re-
source utilization and scalability.
3

Thạc Sĩ Performance Evaluation of Distributed SQL Query Engines and Query Time Predictors

Phí Lan Dương New Member
Thành viên vàng

Thạc Sĩ Effects of organic insecticides on the performance of soybean (Glycine Max L.) varieties in Thai Ngu

Thạc Sĩ Factors influencing international joint ventures' performance An investigation into the telecommunic

Thạc Sĩ Accounting imformation, individua diffrentces, and attributions in the performance evalution process

Tiến Sĩ Evaluation of Local Feed Resources for Hybrid Catfish (Clarias macrocephalus x C. gariepinus) in Sma

Tiến Sĩ Evaluation of Locally Available Feed Resources for Striped Catfish (Pangasianodon hypophthalmus)

Tải tài liệu

Diễn đàn

Chứng nhận bảo mật

Theo dõi chúng tôi

Tìm kiếm hữu ích

Thạc Sĩ Performance Evaluation of Distributed SQL Query Engines and Query Time Predictors

Phí Lan Dương New Member Thành viên vàng

Thạc Sĩ Effects of organic insecticides on the performance of soybean (Glycine Max L.) varieties in Thai Ngu

Thạc Sĩ Factors influencing international joint ventures' performance An investigation into the telecommunic

Thạc Sĩ Accounting imformation, individua diffrentces, and attributions in the performance evalution process

Tiến Sĩ Evaluation of Local Feed Resources for Hybrid Catfish (Clarias macrocephalus x C. gariepinus) in Sma

Tiến Sĩ Evaluation of Locally Available Feed Resources for Striped Catfish (Pangasianodon hypophthalmus)

Phí Lan Dương New Member
Thành viên vàng