Thạc Sĩ Faut management in Inter-Cloud system

Phí Lan Dương · 17/11/15

Table of Contents
1 Introduction 13
2 Literature Review 15
2.1 Fault 15
2.2 Event Correlation . 20
2.2.1 Event Correlation Techniques 20
2.2.2 Existing Open Source Event Correlation Software 26
2.3 Related Cloud and Fault Management Software . 27
2.4 Machine Learning . 29
2.4.1 Feature Extraction from Logs . 30
2.5 OpenStack . 34
2.6 Hadoop 40
3 Proposal 48
4 Experiment 53
4.1 Use existing open source tools for monitoring and correlating logs . 53
4.1.1 Setup OpenStack . 53
4.1.2 Setup Ganglia . 60
4.1.3 Setup Hadoop and OpenStack on Windows Azure 68
4.1.4 Open Stack Log Collection and Processing . 71
4.1.5 Open Stack Database Tables . 74
5 Conclusion and Future Work 76
6A Setup and Conﬁguration 78
A.1 Setup OpenStack with Fuel 78
A.2 Setup and Conﬁgure Ganglia . 81
A.3 Logstash Conﬁguration 82
7List of Figures
2-1 A taxonomy of faults [1] . 16
2-2 A taxonomy for online failure prediction approaches [2] 18
2-3 Fault management on inter-cloud enviroment 28
2-4 OpenStack conceptual architecture [3] 36
2-5 OpenStack logical architecture [3] 37
2-6 Devstack’s localrc for controller node (192.168.1.5) . 38
2-7 Devstack’s localrc for compute node (192.168.1.6) . 38
2-8 nova-manage service list . 38
2-9 Launch an instance from Horizon dashboard . 39
2-10 MapReduce workﬂow . 41
2-11 Hadoop ecosystem 42
2-12 HCatalog - table list 43
2-13 HCatalog - batting_data table . 43
2-14 HCatalog - master_data table . 44
2-15 Hive query . 44
2-16 Hive query result . 45
2-17 Hive query log . 45
2-18 Pig query . 46
2-19 Pig query result 46
2-20 Pig query log . 47
3-1 Fault analyzer in the fault resolution system . 49
3-2 Log management model 50
3-3 OpenStack Log Analysis Block Diagram [4] 51
83-4 Monitoring and Alerting for OpenStack [5] . 52
4-1 Critical issue from cinder-scheduler service . 54
4-2 Error from nova-compute service . 54
4-3 OpenStack Nova log ﬁles on controller node . 54
4-4 Error from savanna-api log 55
4-5 Node Overview 61
4-6 Summary Node Metric Last Hour . 62
4-7 CPU Metrics 63
4-8 Disk Metrics 64
4-9 Load Metrics . 65
4-10 Memory Metrics 66
4-11 Network and Process Metrics . 67
4-12 Ganglia metrics on Graphite . 68
4-13 Windows Azure Virtual Network for Hadoop and OpenStack clusters . 69
4-14 Hadoop cluster on Windows Azure . 70
4-15 OpenStack Juno on Windows Azure . 70
4-16 Logstash Historam 72
4-17 Open Stack Log Type Summary . 72
4-18 Query and ﬁlter Open Stack Logs . 73
4-19 Open Stack Nova log . 73
4-20 Error status of an instance on OpenStack Dashboard . 74
4-21 Information from nova.instance_faults and nova.instances tables 75
4-22 Exception details from nova.instance_faults table 75
5-1 Log Analysis Workﬂow 77
A-1 Fuel Server 79
A-2 Fuel UI 79
A-3 Successfully Havana Deployment on Fuel 80
A-4 Open Stack Havana Services . 81
9List of Tables
2.1 Advantages and drawbacks of the presented event correlation approaches 25
2.2 OpenStack services 35
2.3 OpenStack Log Location . 39
4.1 OpenStack Cinder Log Files . 56
4.2 OpenStack Nova Log Files 57
4.3 OpenStack Horizon Log Files . 58
4.4 OpenStack Keystone Log Files 58
4.5 OpenStack Glance Log Files . 58
4.6 OpenStack Ceilometer Log Files . 59
4.7 OpenStack Heat Log Files 60
4.8 OpenStack Savanna Log Files 60
10Abstract
Nowadays, managing applications on inter-cloud environment especially monitoring faults
becomes challenging due to the increasing of complexity and diversity of these systems.
The inter-cloud environment fostering the centralization of various services need a large
number of system administrators and supporting systems to manage faults occurring in the
inter-cloud systems and services. It is necessary to develop a supporting system that can
managing and analysing faults.
This master thesis deals with the topic of fault management on inter-cloud systems.
This thesis research investigates multiple studies of fault, techniques, and related fault man-
agement software. We setup inter-cloud environment and propose various approaches for
monitoring and analysing fault on inter-cloud system. In particular, we study OpenStack,
Hadoop components and their ecosystems to understand the complexity of inter-cloud en-
vironment. We deploy and integrate several open source tools for monitoring and analysing
faults in the inter-cloud environment.
Keywords: Fault Management, Inter-Cloud, Cloud Computing, OpenStack, Hadoop,
Event Correlation
11This page is intentionally left blank
12Chapter 1
Introduction
Communication networks and distributed systems today become more and more large
and complex to adapt the increasing demand of users. Managing services operating on
these systems is even more challenging. Cloud computing has recently emerged as a new
paradigm of provisioning infrastructure, platform, and software as services over the In-
ternet. This paradigm combines distributed computing resources and virtualization tech-
nologies that outsource not only platform and software but also infrastructure to solve the
demand of users. Cloud computing is attractive to business owners and scientists as it al-
lows them to deploy many types of workloads on demand easily. As we live in the data
age, cloud computing is also an enabler for big data processing. In the last few years,
there is signiﬁcant increase in the number of commercial and open source cloud platforms,
for example, Amazon EC2 [6], Google App Engine [7], Microsoft Windows Azure [8],
OpenStack [9], Eucalyptus [10], Nimbus [11], OpenNebula [12].
In similar context, Hadoop [13] - an Apache open source project and widely adopted
MapReduce implementation - has evolved rapidly into a major technology movement. It
has emerged as the best way to handle massive amounts of data, including not only struc-
tured data but also complex, unstructured data as well. The Apache Hadoop open-source
software supports reliable, scalable, distributed computing for large datasets on clusters
of computers using programming models. The software features the capability of scaling
up from one to thousands of computers, detecting and handling failures at the application
layer. Hadoop systems also support distributed computing services for processing large
13datasets on clusters of workstations. The Hadoop system associated with the MapReduce
programming methodology has been applied to multiple application domains related to
large data processing, such as indexing a large number of web pages, doing ﬁnancial risk
analysis and studying customer behavior.
From the varieties of cloud and big data providers, consumers may have a lot of work-
loads running across their inter-cloud environment. Managing applications on inter-cloud
environment especially monitoring faults becomes challenging due to the increasing of
complexity and diversity of these systems. As a result, inter-cloud environment fostering
the centralization of various services need a large number of system administrators and
supporting systems to manage faults occurring in the inter-cloud systems and services. It
is necessary to develop a supporting system that can managing and analysing faults.
In this thesis, we propose an approach for monitoring and analysing faults on the inter-
cloud environment. The approach recruits open source technologies to facilitate monitoring
and correlating services logs among cloud systems. The contribution is thus twofold:
1. Studying faults and existing techniques and tools of fault management on cloud sys-
tems. We also study OpenStack, Hadoop components and their ecosystems to under-
stand the complexity of inter-cloud environment.
2. Deploying and integrating several open source tools for monitoring and analysing
faults. In particular, we collect and process services logs on inter-cloud environment
including OpenStack and Hadoop components.
The rest of the thesis is structured as follows: the next chapter presents the literature
review of faults, survey of tools and techniques of faults management on single cloud, inter-
cloud environment. The chapter 3 furnishes the proposal of the thesis research. We propose
approaches for monitoring and analysing faults on the inter-cloud environment with the
system architecture and component communication. The chapter 4 provides experiments
for monitoring and analysing faults on the inter-cloud systems. The chapter 5 concludes
this thesis with the short discussion of the ongoing work. Last but not least, the Appendix A
provides the details of setup and conﬁguration that have been used in the thesis.
14Chapter 2
Literature Review
2.1 Fault
Fault in cloud computing and Hadoop has attracted several research activities. Avizienis et
al. [1] has presented basic concepts and deﬁnition associated with system dependability.
According to the study, a failure or service failure is observed as a deviation from the
correct state of the system. An error is the part of the total state of the system that may
lead to service failure. The root cause of an error is a fault. All faults that may affect a
system during its life are classiﬁed according to eight basic viewpoints as shown in Figure
2-1. Thus, there would be 256 different combined fault classes if all combinations of the
eight elementary fault classes were possible.
15Figure 2-1: A taxonomy of faults [1]
According to this study [1], major techniques for handling faults can be also grouped
into:
∙ fault prevention: means to prevent the occurrence or introduction of faults;
∙ fault tolerance: means to avoid service failures in the presence of faults;
∙ fault removal: means to reduce the number and severity of faults;
16∙ fault forecasting: means to estimate the present number, the future incidence, and
the likely consequences of faults;
A topical survey of Salfner et al. [2] presents variety of online failure prediction meth-
ods. A taxonomy has been developed for failure prediction which is based on runtime
monitoring and a variety of models and methods that use the current state of a system and
the past experience. As shown in Figure 2-2, the full taxonomy is split vertically into four
major branches of the type of input data used, namely data from failure tracking, symptom
monitoring, detected error reporting, and undetected error auditing. Each major branch
is further divided vertically into principal approaches. Each principal approach is then
horizontally divided into categories grouping the surveyed methods.

Thạc Sĩ Faut management in Inter-Cloud system

Phí Lan Dương New Member
Thành viên vàng

Thạc Sĩ Digital content of display distribution system optimization for technology management applications

Thạc Sĩ The management of foreign exchange rate regime in a market-oriented economy.

Thạc Sĩ Abstract of the thesis management s preperences for acounting standards

Tiến Sĩ Impact of Financial Management on the Profitability of Small and Medium Trade and Service Enterprise

Tiến Sĩ Policy and Recommendations on Administrative Management of Foreign Labor Bases for Capability Buildi

Tải tài liệu

Diễn đàn

Chứng nhận bảo mật

Theo dõi chúng tôi

Tìm kiếm hữu ích

Thạc Sĩ Faut management in Inter-Cloud system ﻿

Phí Lan Dương New Member Thành viên vàng

Thạc Sĩ Digital content of display distribution system optimization for technology management applications ﻿

Thạc Sĩ The management of foreign exchange rate regime in a market-oriented economy.

Thạc Sĩ Abstract of the thesis management s preperences for acounting standards

Tiến Sĩ Impact of Financial Management on the Profitability of Small and Medium Trade and Service Enterprise

Tiến Sĩ Policy and Recommendations on Administrative Management of Foreign Labor Bases for Capability Buildi

Thạc Sĩ Faut management in Inter-Cloud system

Phí Lan Dương New Member
Thành viên vàng

Thạc Sĩ Digital content of display distribution system optimization for technology management applications