Tiến Sĩ Distributed solutions in privacy preserving data mining

Thảo luận trong 'THẠC SĨ - TIẾN SĨ' bắt đầu bởi Phí Lan Dương, 23/11/13.

  1. Phí Lan Dương

    Phí Lan Dương New Member
    Thành viên vàng

    Bài viết:
    18,524
    Được thích:
    18
    Điểm thành tích:
    0
    Xu:
    0Xu
    Luận án tiến sĩ năm 2011
    Đề tài: Distributed solutions in privacy preserving data mining
    Nghiên cứu xây dựng một số giải pháp đảm bảo an toàn thông tin trong quá trình khai phá dữ liệu



    Contents
    1 INTRODUCTION 1
    1.1 Privacy-preserving data mining: An overview 1
    1.2 Objectives and contributions 5
    1.3 Related works 7
    1.4 Organization of thesis 12
    2 METHODS FOR SECURE MULTI-PARTY COMPUTATION 13
    2.1 Definitions 13
    2.1.1 Computational indistinguishability 13
    2.1.2 Secure multi-party computation 14
    2.2 Secure computation 15
    2.2.1 Secret sharing 15
    2.2.2 Secure sum computation 16
    2.2.3 Probabilistic public key cryptosystems 17
    2.2.4 Variant ElGamal Cryptosystem 18
    2.2.5 Oblivious polynomial evaluation 20
    2.2.6 Secure scalar product computation 21
    2.2.7 Privately computing In a: 22
    3 PRIVACY PRESERVING FREQUENCY-BASED LEARNING IN 2PFD SETTING 24
    3.1 Introduction 24
    3.2 Privacy preserving frequency mining in 2PFD setting 27
    3.2.1 Problem formulation 27
    3.2.2 Definition of privacy 29
    3.2.3 Frequency mining protocol 30
    3.2.4 Correctness Analysis 32
    3.2.5 Privacy Analysis 34
    3.2.6 Efficiency of frequency mining protocol 37
    3.3 Privacy Preserving Frequency-based Learning in 2PFD Setting 38
    3.3.1 Naive Bayes learning problem in 2PFD setting 38
    3.3.2 Naive Bayes learning Protocol 40
    3.3.3 Correctness and privacy analysis 42
    3.3.4 Efficiency of naive Bayes learning protocol 42
    3.4 An improvement of frequency mining protocol 44
    3.4.1 Improved frequency mining protocol 44
    3.4.2 Protocol Analysis 45
    3.5 Conclusion 46
    ENHANCING PRIVACY FOR FREQUENT ITEMSET MINING IN VERTICALLY . 49
    4.1 Introduction 49
    4.2 Problem formulation 51
    4.2.1 Association rules and frequent itemset 51
    4.2.2 Frequent itmeset identifying in vertically distributed data 52
    4.3 Computational and privacy model 53
    4.4 Support count preserving protocol 54
    4.4.1 Overview 54
    4.4.2 Protocol design 56
    4.4.3 Correctness Analysis 57
    4.4.4 Privacy Analysis 59
    4.4.5 Performance analysis 61
    4.5 Support count computation-based protocol 64
    4.5.1 Overview 64
    4.5.2 Protocol Design 65
    4.5.3 Correctness Analysis 65
    4.5.4 Privacy Analysis 67
    4.5.5 Performance analysis 68
    4.6 Using binary tree communication structure 69
    4.7 Privacy-preserving distributed Apriori algorithm 70
    4.8 Conclusion 71
    5 PRIVACY PRESERVING CLUSTERING 73
    5.1 Introduction 73
    5.2 Problem statement 74
    5.3 Privacy preserving clustering for the multi-party distributed data 76
    5.3.1 Overview 76
    5.3.2 Private multi-party mean computation 78
    5.3.3 Privacy preserving multi-party clustering protocol 80
    5.4 Privacy preserving clustering without disclosing cluster centers 82
    5.4.1 Overview 83
    5.4.2 Privacy preserving two-party clustering protocol . 85
    5.4.3 Secure mean sharing 87
    5.5 Conclusion 88
    6 PRIVACY PRESERVING OUTLIER DETECTION 91
    6.1 Introduction 91
    6.2 Technical preliminaries 92
    6.2.1 Problem statement 92
    6.2.2 Linear transformation 93
    6.2.3 Privacy model 94
    6.2.4 Private matrix product sharing 95
    6.3 Protocols for the horizontally distributed data 95
    6.3.1 Two-party protocol 97
    6.3.2 Multi-party protocol 100
    6.4 Protocol for two-party vertically distributed data 101
    6.5 Experiments 104
    6.6 Conclusions 106
    SUMMARY 107
    Publication List 110
    Bibliography 111



    Chapter 1 INTRODUCTION
    1.1. Privacy-preserving data mining: An overview
    Data mining plays an important role in the current world and provides IIS a powerful tool to efficiently discover valuable information from large databases [25]. However, the process of mining data can result in a viola¬tion of privacy, therefore, issues of privacy preservation in data mining are receiving more and more attention from the this community [52]. As a re¬sult, there are a large number of studies has been produced on the topic of privacy-preserving data mining (PPDM) [72]. These studies deal with the problem of learning data mining models from the databases, while protecting data privacy at the level of individual records or the level of organizations.
    Basically, there are three major problems in PPDM [8]. First, the organi¬zations such as government agencies wish to publish their data for researchers and even community. However, thev want to preserve the data privacy, for example, highly sensitive financial and health private data. Second, a group of the organizations (or parties) wishes to together obtain the mining re¬sult on their joint data without disclosing each party’s privacy information. Third, a miner wishes to collect data or obtain the data mining models from the individual users, while preserving privacy of each user. Consequently, PPDM can be formed into three following areas depending on the models of information sharing.
    Privacy-preserving data publishing: The model of this research con¬sists of onlv an organization, is the trusted data holder. This organization washes to publish its data to the miner or the research community such that the anonymized data are useful for the data mining applications. For exam¬ple, some hospitals collect records from their patients for the some required
    medical services. These hospital can be the trusted data holder, however the patients may not trust the hospital when they send their data to the miner.
    When we publish our anonymized data there are many evidences show that the publishing data may make privacy breaches via some attacks. One of them is called re-identification and as showed in [61]. For example, there are 87% of the American population has characteristics that allows identifying them uniquely based on several public attributes, namely zip code, date of birth, and ***. Consequently, privacy preserving data publishing has received many attentions in recent years that aims to prevent re-identification attack, while preserving the useful information for data mining applications from the released data.
    The general technique called k-anonymity [51], [82].[6], the goal here is to make the property that obtains the protection of released data to fight against re-identification possibility of the released data. Consider a private data table, where the data have been removed explicit identifiers (e.g., SSN and Name). However, values of other released attributes, such as ZIP, Date of birth, Marital status, and *** can also appear in some external sources that may still joint with the individual users’ identities. If some combinations of values for these attributes occur uniquely or rarely, then from observing the data one can determine the identity of the user or deduce a limited set that consists of user. The goal of k-anonymity is that every tuple in the released private table is indistinguishability from at least other k users.
    Privacy-preserving distributed data mining: This research area aims to develop distributed data mining algorithms without accessing orig¬inal data [33, 79. 35, 68, 80, 40]. Different from privacy preserving data publishing, each study in privacy-preserving distributed data mining is often to solve a specific data mining task. The model of this area usually consists of several parties instead, each party has one private data set. The gen¬eral purpose is to enable the parties for mining cooperatively on their joint data sets without revealing private information to other participating par¬ties. Here, the way the data is distributed on parties also plays an important role in the solved problem. Generally, data could be distributed into many
    parts either vertically or horizontally.
    In horizontally distribution, a data set is distributed into several parties. Every party has data records with the same set of attributes. For example, the customer databases union of the different banks. Typically, banks have different services for their clients as a savings account, choice of credit card, stock investments, etc. Assuming that banks wish to predict who are safe customers, who may be risk ones and may be frauds. Gathering all kinds of financial data about their customers and their transactions can help them in the above predictions, thus they prevent huge financial losses. Using reasonable techniques for mining on the gathered data can generalize over these datasets and identify possible risks for future cases or transactions. More specifically, when a customer Nam goes to Bank A to apply for a loan to buy a car. He needs to provide the necessary information to the Bank A. An expert system of Bank A can use k-NN algorithm to classify Nam as either a risk or safe customer. If this system only uses the database of the bank A, it could happen that Bank A does not have enough customers that are close to Nam. Therefore, the system may produce a wrong classification. For example, Nam is one safe customer but the system recognized him as a risk customer. Consequently, the Bank A has a loss of profit. It is clear that the mining on the larger database can result in the more accurate classification. Thus, the classification on the joint databases of bank A and other banks might give a more accurate result and Nani could have been classified as safe. However, the problem is that privacy restrictions would not allow the banks to access to each other’s databases. So, privacy-preserving distributed data mining can help this problem. This scenario is a typical case of so- called horizontally partitioned data. In the context of privacy-preserving data mining, banks do not need to reveal their databases to each other. They can still apply k-NN classification to the joint databases of banks while preserving each bank’s privacy information.
    In vertically distribution, a data set is distributed into some parties. Every party owns a vertical part of every record in the database (it holds records for a subset of the attributes). For example, financial transaction



    Bibliography
    [1] Dakshi Agrawal and Charu c. Aggarwal. On the design and quantifica¬tion of privacy preserving data mining algorithms. In Proceedings of the twentieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pages 247-255. ACM, 2001.
    [2] Rakesh Agrawal and Ramakrishnan Srikant. Privacy-preserving data mining. SIGMOD Rec29(2):439-450, 2000.
    [3] Rakesh Agrawal, Eamakrishnan Srikant, and Dilys Thomas. Privacy preserving olap. In SIGMOD ’05: Proceedings of the 2005 ACM SIG- MOD international conference on Management of data, pages 251-262. ACM, 2005.
    [4] Shipra Agrawal and Jayant R. Haritsa. A framework for high-accuracv privacy-preserving mining. In ICDE ’05: Proceedings of the 21st Inter¬national Conference on Data Engineering, pages 193-204. IEEE Com¬puter Society, 2005.
    [5] Fatemah A. Alqallaf, Kjell p. Konis, R. Douglas Martin, and Ruben H. Zamar. Scalable robust covariance and correlation estimates for data mining. In Proceedings of the eighth ACM SIGKDD international con¬ference on Knowledge discovery and data mining. KDD ’02, pages 14-23. ACM, 2002.
    [6] Roberto J. Bayardo and Rakesh Agrawal. Data privacy through optimal k-anonvmization. In ICDE ’05: Proceedings of the 21st International
    111
    Conference on Data Engineering, pages 217-228. IEEE Computer Soci¬ety, 2005.
    [7] Dan Boneh. The decision diffie-hellman problem. In Proceedings of the Third International Symposium on Algorithmic Number Theory, pages 48-63. Springer-Verlag, 1998.
    [8] Aggarwal c. Charu and Philip s. Yu. Privacy-Preserving Data Mining: Models and Algorithms. ASPVU, Boston, MA, United States, 2008.
    [9] David w. Cheung, Jiawei Han, Vincent T. Ng, Ada w. Fu, and Yongjian Fu. A fast distributed algorithm for mining association rules. In DIS ’96: Proceedings of the fourth international conference on on Parallel and distributed information systems, pages 31-43. IEEE Computer Society, 1996.
    [10] Chris Clifton, Murat Kantarcioglu, Jaideep Vaidya, Xiaodong Lin, and Michael Y. Zhu. Tools for privacy preserving distributed data mining. SIGKDD Explor. Newsi, 4(2):28-34, 2002.
    [11] Chris Clifton, Murat Kantarcioglu, Jaideep Vaidya, Xiaodong Lin, and Michael Y. Zhu. Tools for privacy preserving distributed data mining. ACM S1GKDD Explorations, 4:2003, 2003.
    [12] Ivan Damgard and Mads Jurik. A generalisation, a simplification and some applications of paillier, probabilistic public-key system. In In pro-ceedings of PKC 01, LNCS series, pages 119-136. Springer-Verlag, 2001.
    [13] Elena Dasseni, Vassilios s. Verykios, Ahmed K. Elmagarmid, and Elisa Bertino. Hiding association rules by using confidence and support. In IHW ’01: Proceedings of the 4th International Workshop on Information Hiding, pages 369-383, London, UK, 2001. Springer-Verlag.
    [14] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likeli¬hood from incomplete data via the em algorithm. JOURNAL OF THE ROYAL STATISTICAL SOCIETY, SERIES R. 39(l):l-38, 1977. 112
     

    Các file đính kèm:

Đang tải...