Thạc Sĩ Performance Evaluation of Distributed SQL Query Engines and Query Time Predictors

Thảo luận trong 'THẠC SĨ - TIẾN SĨ' bắt đầu bởi Phí Lan Dương, 13/10/15.

  1. Phí Lan Dương

    Phí Lan Dương New Member
    Thành viên vàng

    Bài viết:
    18,524
    Được thích:
    18
    Điểm thành tích:
    0
    Xu:
    0Xu
    Abstract
    With the decrease in cost of storage and computation of public clouds, even small
    and medium enterprises (SMEs) are able to process large amounts of data. This
    causes businesses to increase the amounts of data they collect, to sizes that are
    difficult for traditional database management systems to handle. Distributed SQL
    Query Engines (DSQEs), which can easily handle these kind of data sizes, are
    therefore increasingly used in a variety of domains. Especially users in small
    companies with little expertise may face the challenge of selecting an appropri-
    ate engine for their specific applications. A second problem lies with the variable
    performance of DSQEs. While all of the state-of-the-art DSQEs claim to have very
    fast response times, none of them has performance guarantees. This is a serious
    problem, because companies that use these systems as part of their business do
    need to provide these guarantees to their customers as stated in their Service Level
    Agreement (SLA).
    Although both industry and academia are attempting to come up with high level
    benchmarks, the performance of DSQEs has never been explored or compared in-
    depth. We propose an empirical method for evaluating the performance of DSQEs
    with representative metrics, datasets, and system configurations. We implement
    a micro-benchmarking suite of three classes of SQL queries for both a synthetic
    and a real world dataset and we report response time, resource utilization, and
    scalability. We use our micro-benchmarking suite to analyze and compare three
    state-of-the-art engines, viz. Shark, Impala, and Hive. We gain valuable insights
    for each engine and we present a comprehensive comparison of these DSQEs. We
    find that different query engines have widely varying performance: Hive is always
    being outperformed by the other engines, but whether Impala or Shark is the best
    performer highly depends on the query type.
    In addition to the performance evaluation of DSQEs, we evaluate three query
    time predictors of which two are using machine learning, viz. multiple linear re-
    gression and support vector regression. These query time predictors can be used as
    input for scheduling policies in DSQEs. The scheduling policies can then change
    query execution order based on the predictions (e.g., give precedence to queries
    that take less time to complete). We find that both machine learning based predict-
    ors have acceptable performance, while a baseline naive predictor is more than two
    times less accurate on average.ivPreface
    Ever since I started studying Computer Science I have been fascinated about the
    ways tasks can be distributed over multiple computers and be executed in paral-
    lel. Cloud Computing and Big Data Analytics appealed to me for this very reason.
    This made me decide to conduct my thesis project at Azavista, a small start-up
    company based in Amsterdam specialised in providing itinerary planning tools for
    the meeting and event industry. At Azavista there is a particular interest in provid-
    ing answers to analytical questions to customers in near real-time. This thesis is
    the result of the efforts to realise this goal.
    During the past year I have learned a lot in the field of Cloud Computing, Big Data
    Analytics, and (Computer) Science in general. I would like to thank my supervisors
    Prof.dr.ir. D.H.J Epema and Dr.ir. A. Iosup for their guidance and encouragement
    throughout the project. Me being a perfectionist, it was very helpful to know when
    I was on the right track. I also want to thank my colleague and mentor Dr. Jos´e M.
    Vi˜na Rebolledo for his many insights and feedback during the thesis project. I am
    very grateful both him and my friend Jan Zah´alka helped me understand machine
    learning, which was of great importance for the second part of my thesis.
    I want to thank my company supervisors Robert de Geus and JP van der Kuijl
    for giving me the freedom to experiment and providing me the financial support
    for running experiments on Amazon EC2. Furthermore I want to also thank my
    other colleagues at Azavista for the great time and company, and especially Mervin
    Graves for his technical support.
    I want to thank Sietse Au, Marcin Biczak, Mihai Capotˇa, Bogdan Ghit¸, Yong
    Guo, and other members of the Parallel and Distributed Systems Group for sharing
    ideas. Last but not least, I want to also thank my family and friends for providing
    great moral support, especially during the times progress was slow.
    Stefan van Wouw
    Delft, The Netherlands
    10th October 2014
    vviContents
    Preface v
    1 Introduction 1
    1.1 Problem Statement . 2
    1.2 Approach 3
    1.3 Thesis Outline and Contributions 3
    2 Background and Related Work 5
    2.1 Cloud Computing 5
    2.2 State-of-the-Art Distributed SQL Query Engines . 10
    2.3 Related Distributed SQL Query Engine Performance Studies . 15
    2.4 Machine Learning Algorithms . 16
    2.5 Principal Component Analysis . 21
    3 Performance Evaluation of Distributed SQL Query Engines 23
    3.1 Query Engine Selection . 23
    3.2 Experimental Method 23
    3.2.1 Workload 24
    3.2.2 Performance Aspects and Metrics . 25
    3.2.3 Evaluation Procedure 26
    3.3 Experimental Setup . 26
    3.4 Experimental Results 29
    3.4.1 Processing Power 29
    3.4.2 Resource Consumption . 31
    3.4.3 Resource Utilization over Time 33
    3.4.4 Scalability . 33
    3.5 Summary 36
    4 Performance Evaluation of Query Time Predictors 39
    4.1 Predictor Selection . 39
    4.2 Perkin: Scheduler Design 40
    4.2.1 Use Case Scenario . 40
    4.2.2 Architecture . 41
    4.2.3 Scheduling Policies . 41
    vii4.3 Experimental Method 43
    4.3.1 Output Traces 43
    4.3.2 Performance Metrics 47
    4.3.3 Evaluation Procedure 48
    4.4 Experimental Results 49
    4.5 Summary 51
    5 Conclusion and Future Work 53
    5.1 Conclusion . 53
    5.2 Future Work . 54
    A Detailed Distributed SQL Query Engine Performance Metrics 61
    B Detailed Distributed SQL Query Engine Resource Utilization 65
    C Cost-based Analytical Modeling Approach to Prediction 69
    D Evaluation Procedure Examples 73
    viiiChapter 1
    Introduction
    With the decrease in cost of storage and computation of public clouds, even small
    and medium enterprises (SMEs) are able to process large amounts of data. This
    causes businesses to increase the amounts of data they collect, to sizes that are
    difficult for traditional database management systems to handle. Exactly this chal-
    lenge was also encountered at Azavista, the company this thesis was conducted
    at. In order to assist customers in planning itineraries using its software for event
    and group travel planning, Azavista processes multi-terabyte datasets every day.
    Traditional database management systems that were previously used by this SME
    simply did not scale along with the size of the data to be processed.
    The general phenomenon of exponential data growth has led to Hadoop-oriented
    Big Data Processing Platforms that can handle multiple terabytes to even petabytes
    with ease. Among these platforms are stream processing systems such as S4 [44],
    Storm [22], and Spark Streaming [64]; general purpose batch processing systems
    like Hadoop MapReduce [6] and Haloop [25]; and distributed SQL query engines
    (DSQEs) such as Hive [53], Impala [15], Shark [59], and more recently, Presto
    [19], Drill [35], and Hive-on-Tez [7].
    Batch processing platforms are able to process enormous amounts of data (tera-
    bytes and up) but have relatively long run times (hours, days, or more). Stream
    processing systems, on the other hand, have immediate results when processing a
    data stream, but can only perform a subset of algorithms due to not all data be-
    ing available at any point in time. Distributed SQL Query Engines are generally
    built on top of (a combination of) stream and batch processing systems, but they
    appear to the user as if they were traditional relational databases. This allows the
    user to query structured data using an SQL dialect, while at the same time having
    much higher scalability than traditional databases. Besides these different systems,
    hybrids also do exist in form of so called lambda architectures [17], where data
    is both processed by a batch processing system and by a stream processor. This
    allows the stream processing to get fast but approximate results, while in the back
    the batch processing system slowly computes the results accurately.
    In this work we focus on the DSQEs and their internals, since although authors
    1claim them to be fast and scalable, none of them provides deadline guarantees
    for queries with deadlines. In addition, no in-depth comparisons between these
    systems are available.
    1.1 Problem Statement
    Selecting the most suitable of all available DSQEs for a particular SME is a big
    challenge, because SMEs are not likely to have the expertise and the resources
    available to perform an in-depth study. Although performance studies do exist for
    Distributed SQL Query Engines [4, 16, 33, 34, 47, 59], many of them only use
    synthetic workloads or very high-level comparisons that are only based on query
    response time.
    A second problem lies with the variable performance of DSQEs. While all of
    the state-of-the-art DSQEs claim to have very fast response times (seconds instead
    of minutes), none of them has performance guarantees. This is a serious prob-
    lem, because companies that use these systems as part of their business do need to
    provide these guarantees to their customers as stated in their Service Level Agree-
    ment (SLA). There are many scenarios where multiple tenants1 are using the same
    data cluster and resources need to be shared (e.g., Google’s BigQuery). In this case,
    queries might take much longer to complete than in a single-tenant environment,
    possibly violating SLAs signed with the end-customer. Related work provides a
    solution to this problem in form of BlinkDB [24]. This DSQE component for
    Shark does provide bounded query response times, but at the cost of less accurate
    results. However, one downside of this component is that it is very query engine
    dependent, as it uses a cost-based analytical heuristic to predict the execution time
    of different parts of a query.
    In this thesis we try to address the lack of in-depth performance evaluation of
    the current state-of-the-art DSQEs by answering the following research question:
    RQ1 What is the performance of state-of-the-art Distributed SQL Query Engines
    in a single-tenant environment?
    After answering this question we evaluate an alternative to BlinkDB’s way of
    predicting query time by answering the following research question:
    RQ2 What is the performance of query time predictors that utilize machine learn-
    ing techniques?
    Query time predictors are able to predict both the query execution time and
    query response time. The query execution time is the time a query is actively
    being processed, whereas the query response time also takes the queue wait time
    1
    A tenant is an actor of a distributed system that represents a group of end-users. For example, a
    third party system that issues to analyze some data periodically, in order to display it to all its users
    on a website.
    2into account. The predicted execution time can be used to change the query exe-
    cution order in the system as to minimize response time. We are particularly inter-
    ested in applying machine learning techniques, because it has been shown to yield
    promising results in this field [60]. In addition, machine learning algorithms do
    not require in-depth knowledge of inner DSQE mechanics. Thus, machine learn-
    ing based query time predictors can easily be applied to any query engine, while
    BlinkDB is tied to many internals of Shark.
    1.2 Approach
    To answer the first research question (RQ1) we define a comprehensive perform-
    ance evaluation method to assess different aspects of query engines. We com-
    pare Hive, a somewhat older but still widely used query engine, with Impala and
    Shark, both state-of-the-art distributed query engines. This method can be used to
    compare current and future query engines, despite not covering all the methodo-
    logical and practical aspects of a true benchmark. The method focuses on three
    performance aspects: processing power, resource utilization and scalability. With
    the results from this study, system developers and data analysts can make informed
    choices related to both cluster infrastructure and query tuning.
    In order to answer research question two (RQ2), we evaluate three query time
    predictor methods, namely Multiple Linear Regression (MLR), Support Vector Re-
    gression (SVR), and a base-line method Last2. We do this by designing a workload
    and training the predictors on the output traces of the three different ways we ex-
    ecuted this workload. Predictor accuracy is reported by using three complementary
    metrics. The results from this study allow engineers to select a predictor for use in
    DSQEs that is both fast to train and accurate.
    1.3 Thesis Outline and Contributions
    The thesis is structured as follows: In Chapter 2 we provide background informa-
    tion and related work, including the definition of the Cloud Computing paradigm,
    an overview of state-of-the-art Distributed SQL Query Engines and background in-
    formation regarding machine learning. In Chapter 3 we evaluate the state-of-the-art
    Distributed SQL Query Engines’ performance on both synthetic real-world data. In
    Chapter 4 we evaluate query time predictors that use machine learning techniques.
    In Chapter 5 we present conclusions to our work and describe directions for future
    work.
    Our main contributions are the following:
    ã We propose a method for performance evaluation of DSQEs (Chapter 3),
    which includes defining a workload representative for SMEs as well as de-
    fining the performance aspects of the query engines: processing power, re-
    source utilization and scalability.
    3
     
Đang tải...