Thạc Sĩ Faut management in Inter-Cloud system 

Thảo luận trong 'THẠC SĨ - TIẾN SĨ' bắt đầu bởi Phí Lan Dương, 17/11/15.

  1. Phí Lan Dương

    Phí Lan Dương New Member
    Thành viên vàng

    Bài viết:
    18,524
    Được thích:
    18
    Điểm thành tích:
    0
    Xu:
    0Xu
    Table of Contents
    1 Introduction 13
    2 Literature Review 15
    2.1 Fault 15
    2.2 Event Correlation . 20
    2.2.1 Event Correlation Techniques 20
    2.2.2 Existing Open Source Event Correlation Software 26
    2.3 Related Cloud and Fault Management Software . 27
    2.4 Machine Learning . 29
    2.4.1 Feature Extraction from Logs . 30
    2.5 OpenStack . 34
    2.6 Hadoop 40
    3 Proposal 48
    4 Experiment 53
    4.1 Use existing open source tools for monitoring and correlating logs . 53
    4.1.1 Setup OpenStack . 53
    4.1.2 Setup Ganglia . 60
    4.1.3 Setup Hadoop and OpenStack on Windows Azure 68
    4.1.4 Open Stack Log Collection and Processing . 71
    4.1.5 Open Stack Database Tables . 74
    5 Conclusion and Future Work 76
    6A Setup and Configuration 78
    A.1 Setup OpenStack with Fuel 78
    A.2 Setup and Configure Ganglia . 81
    A.3 Logstash Configuration 82
    7List of Figures
    2-1 A taxonomy of faults [1] . 16
    2-2 A taxonomy for online failure prediction approaches [2] 18
    2-3 Fault management on inter-cloud enviroment 28
    2-4 OpenStack conceptual architecture [3] 36
    2-5 OpenStack logical architecture [3] 37
    2-6 Devstack’s localrc for controller node (192.168.1.5) . 38
    2-7 Devstack’s localrc for compute node (192.168.1.6) . 38
    2-8 nova-manage service list . 38
    2-9 Launch an instance from Horizon dashboard . 39
    2-10 MapReduce workflow . 41
    2-11 Hadoop ecosystem 42
    2-12 HCatalog - table list 43
    2-13 HCatalog - batting_data table . 43
    2-14 HCatalog - master_data table . 44
    2-15 Hive query . 44
    2-16 Hive query result . 45
    2-17 Hive query log . 45
    2-18 Pig query . 46
    2-19 Pig query result 46
    2-20 Pig query log . 47
    3-1 Fault analyzer in the fault resolution system . 49
    3-2 Log management model 50
    3-3 OpenStack Log Analysis Block Diagram [4] 51
    83-4 Monitoring and Alerting for OpenStack [5] . 52
    4-1 Critical issue from cinder-scheduler service . 54
    4-2 Error from nova-compute service . 54
    4-3 OpenStack Nova log files on controller node . 54
    4-4 Error from savanna-api log 55
    4-5 Node Overview 61
    4-6 Summary Node Metric Last Hour . 62
    4-7 CPU Metrics 63
    4-8 Disk Metrics 64
    4-9 Load Metrics . 65
    4-10 Memory Metrics 66
    4-11 Network and Process Metrics . 67
    4-12 Ganglia metrics on Graphite . 68
    4-13 Windows Azure Virtual Network for Hadoop and OpenStack clusters . 69
    4-14 Hadoop cluster on Windows Azure . 70
    4-15 OpenStack Juno on Windows Azure . 70
    4-16 Logstash Historam 72
    4-17 Open Stack Log Type Summary . 72
    4-18 Query and filter Open Stack Logs . 73
    4-19 Open Stack Nova log . 73
    4-20 Error status of an instance on OpenStack Dashboard . 74
    4-21 Information from nova.instance_faults and nova.instances tables 75
    4-22 Exception details from nova.instance_faults table 75
    5-1 Log Analysis Workflow 77
    A-1 Fuel Server 79
    A-2 Fuel UI 79
    A-3 Successfully Havana Deployment on Fuel 80
    A-4 Open Stack Havana Services . 81
    9List of Tables
    2.1 Advantages and drawbacks of the presented event correlation approaches 25
    2.2 OpenStack services 35
    2.3 OpenStack Log Location . 39
    4.1 OpenStack Cinder Log Files . 56
    4.2 OpenStack Nova Log Files 57
    4.3 OpenStack Horizon Log Files . 58
    4.4 OpenStack Keystone Log Files 58
    4.5 OpenStack Glance Log Files . 58
    4.6 OpenStack Ceilometer Log Files . 59
    4.7 OpenStack Heat Log Files 60
    4.8 OpenStack Savanna Log Files 60
    10Abstract
    Nowadays, managing applications on inter-cloud environment especially monitoring faults
    becomes challenging due to the increasing of complexity and diversity of these systems.
    The inter-cloud environment fostering the centralization of various services need a large
    number of system administrators and supporting systems to manage faults occurring in the
    inter-cloud systems and services. It is necessary to develop a supporting system that can
    managing and analysing faults.
    This master thesis deals with the topic of fault management on inter-cloud systems.
    This thesis research investigates multiple studies of fault, techniques, and related fault man-
    agement software. We setup inter-cloud environment and propose various approaches for
    monitoring and analysing fault on inter-cloud system. In particular, we study OpenStack,
    Hadoop components and their ecosystems to understand the complexity of inter-cloud en-
    vironment. We deploy and integrate several open source tools for monitoring and analysing
    faults in the inter-cloud environment.
    Keywords: Fault Management, Inter-Cloud, Cloud Computing, OpenStack, Hadoop,
    Event Correlation
    11This page is intentionally left blank
    12Chapter 1
    Introduction
    Communication networks and distributed systems today become more and more large
    and complex to adapt the increasing demand of users. Managing services operating on
    these systems is even more challenging. Cloud computing has recently emerged as a new
    paradigm of provisioning infrastructure, platform, and software as services over the In-
    ternet. This paradigm combines distributed computing resources and virtualization tech-
    nologies that outsource not only platform and software but also infrastructure to solve the
    demand of users. Cloud computing is attractive to business owners and scientists as it al-
    lows them to deploy many types of workloads on demand easily. As we live in the data
    age, cloud computing is also an enabler for big data processing. In the last few years,
    there is significant increase in the number of commercial and open source cloud platforms,
    for example, Amazon EC2 [6], Google App Engine [7], Microsoft Windows Azure [8],
    OpenStack [9], Eucalyptus [10], Nimbus [11], OpenNebula [12].
    In similar context, Hadoop [13] - an Apache open source project and widely adopted
    MapReduce implementation - has evolved rapidly into a major technology movement. It
    has emerged as the best way to handle massive amounts of data, including not only struc-
    tured data but also complex, unstructured data as well. The Apache Hadoop open-source
    software supports reliable, scalable, distributed computing for large datasets on clusters
    of computers using programming models. The software features the capability of scaling
    up from one to thousands of computers, detecting and handling failures at the application
    layer. Hadoop systems also support distributed computing services for processing large
    13datasets on clusters of workstations. The Hadoop system associated with the MapReduce
    programming methodology has been applied to multiple application domains related to
    large data processing, such as indexing a large number of web pages, doing financial risk
    analysis and studying customer behavior.
    From the varieties of cloud and big data providers, consumers may have a lot of work-
    loads running across their inter-cloud environment. Managing applications on inter-cloud
    environment especially monitoring faults becomes challenging due to the increasing of
    complexity and diversity of these systems. As a result, inter-cloud environment fostering
    the centralization of various services need a large number of system administrators and
    supporting systems to manage faults occurring in the inter-cloud systems and services. It
    is necessary to develop a supporting system that can managing and analysing faults.
    In this thesis, we propose an approach for monitoring and analysing faults on the inter-
    cloud environment. The approach recruits open source technologies to facilitate monitoring
    and correlating services logs among cloud systems. The contribution is thus twofold:
    1. Studying faults and existing techniques and tools of fault management on cloud sys-
    tems. We also study OpenStack, Hadoop components and their ecosystems to under-
    stand the complexity of inter-cloud environment.
    2. Deploying and integrating several open source tools for monitoring and analysing
    faults. In particular, we collect and process services logs on inter-cloud environment
    including OpenStack and Hadoop components.
    The rest of the thesis is structured as follows: the next chapter presents the literature
    review of faults, survey of tools and techniques of faults management on single cloud, inter-
    cloud environment. The chapter 3 furnishes the proposal of the thesis research. We propose
    approaches for monitoring and analysing faults on the inter-cloud environment with the
    system architecture and component communication. The chapter 4 provides experiments
    for monitoring and analysing faults on the inter-cloud systems. The chapter 5 concludes
    this thesis with the short discussion of the ongoing work. Last but not least, the Appendix A
    provides the details of setup and configuration that have been used in the thesis.
    14Chapter 2
    Literature Review
    2.1 Fault
    Fault in cloud computing and Hadoop has attracted several research activities. Avizienis et
    al. [1] has presented basic concepts and definition associated with system dependability.
    According to the study, a failure or service failure is observed as a deviation from the
    correct state of the system. An error is the part of the total state of the system that may
    lead to service failure. The root cause of an error is a fault. All faults that may affect a
    system during its life are classified according to eight basic viewpoints as shown in Figure
    2-1. Thus, there would be 256 different combined fault classes if all combinations of the
    eight elementary fault classes were possible.
    15Figure 2-1: A taxonomy of faults [1]
    According to this study [1], major techniques for handling faults can be also grouped
    into:
    ∙ fault prevention: means to prevent the occurrence or introduction of faults;
    ∙ fault tolerance: means to avoid service failures in the presence of faults;
    ∙ fault removal: means to reduce the number and severity of faults;
    16∙ fault forecasting: means to estimate the present number, the future incidence, and
    the likely consequences of faults;
    A topical survey of Salfner et al. [2] presents variety of online failure prediction meth-
    ods. A taxonomy has been developed for failure prediction which is based on runtime
    monitoring and a variety of models and methods that use the current state of a system and
    the past experience. As shown in Figure 2-2, the full taxonomy is split vertically into four
    major branches of the type of input data used, namely data from failure tracking, symptom
    monitoring, detected error reporting, and undetected error auditing. Each major branch
    is further divided vertically into principal approaches. Each principal approach is then
    horizontally divided into categories grouping the surveyed methods.
     
Đang tải...