T ABLE OF CONTENTS Acknowledgements .i Abstract ii Table of contents iii List of tables and figures .iv CHAPTER 1: Introduction . 1 1.1. What is data mining? 1 1.2. Data mining versus query tools 2 1.3. Mining association rules . 3 1.4. Outline of the thesis 5 CHAPTER 2: Mining association rules with weighted items 6 2.1. Introduction 6 2.2. Problem definition 7 CHAPTER 3: Mining association rules with adjustable interestingness .10 3.1. Interestingness and interesting itemsets .10 3.2. Interestingness constraints 11 3.3. Motivation behind interesting itemsets and adjustable interestingness .12 CHAPTER 4: Algorithm for mining association rules with adjustable interestingness (MARAI) .14 4.1. Motivation 14 4.2. Preliminaries .15 4.3. Basic properties of itemset-tidset pairs 18 4.4. MARAI: Algorithm design and implementation .20 4.5. Experimental Evaluation 25 CHAPTER 5: Conclusion .28 References a Appendix b ABSTRACT Over the last several years, the problem of efficiently generating large numbers of association rules has been an active research topic in the data mining community. Many different algorithms have been developed with promising results. There are two current approaches to the association rule mining problem. The first is to mine the frequent itemsets regardless of their coefficients. The second is to assign weights to the items to reflect their importance to the users. However, they both rely on the using of the minimum support which may confuse us. Practically, we may want to mine the best rules to our knowledge instead of those which satisfy a certain threshold, especially if this threshold is an equation. To overcome this problem, we introduce the concept of adjustable interestingness and propose a novel approach in mining association rules based on adjustable interestingness. Our algorithm only works with the most interesting rules, thus reducing significantly search space by skipping many uninteresting itemsets and pruning those that cannot generate interesting itemsets at the earlier stage. Therefore, the total time needed for the mining is substantially decreased. CHAPTER 1 INTRODUCTION In this chapter, we introduce the concept of data mining, and explain why it is regarded as such important developments. As companies is the background of mining association rules. 1.1. What is data mining? There is confusion about the exact meaning between the terms ‘data mining’ and ‘knowledge discovery in databases (KDD)’. At the first international KDD conference in Montreal in 1995, it was proposed that the term ‘KDD’ be used to describe the whole process of extraction of knowledge from data. An official definition of KDD is: ‘the non-trivial extraction of implicit, previously unknown and potentially useful knowledge from data’ [2]. The knowledge which is discovered must be new, not obvious, and human must be able to use it for a particular purpose. It was also proposed that the term ‘data mining’ should be used exclusively for the discovery stage of the KDD process. The whole KDD steps include selection, preprocessing, transformation, data mining and the interpretation or evaluation. Data mining has been focused on as it is the most significant and most time-consuming among KDD steps. The sudden rise of interest in data mining can partly be explained by the following factors [2]: 1. In the 1980s, all major organizations built infrastructural databases, containing data about their clients, competitors, and products. These databases form a potential gold-mine; they contain gigabytes of data with much ‘hidden’ information that cannot easily be traced using SQL (Structure Query Language). Data mining algo- 2 rithms can find interesting regularities in databases, whereas, SQL is just a query language; it only helps to find data under constraints of what we already know. 2. As the use of networks continues to grow, it will become increasingly easy to connect databases. Thus, connecting a client’ s file to a file with demographic data may lead to unexpected views on the spending patterns of certain population groups. 3. Over the past few years, machine-learning techniques have expanded enormously. Neural networks, genetic algorithms and other simple, generally applicable learning techniques often makes it easier to find interesting connections in databases. 4. The client/sever revolution gives the individual knowledge worker access to central information systems, from a terminal on his or her desk. 1.2. Data mining versus query tools What is the difference between data mining and a normal query environment? What can a data mining tool do that SQL cannot? It is significant to realize that data mining tools are complementary to query tools. A data mining tool does not replace a query tool but give a lot of additional possibilities [2]. Suppose that we have a large file containing millions of records that describe customers’ purchases in a supermarket. There is a wealth of potentially useful knowledge which can be found by trigger normal queries, such as ‘Who bought butter and bread last week?’ , ‘Is the profit of this month more than that of last month?’ and so on. There is, however, knowledge hidden in the databases that is much harder to find using SQL. Examples would be the answers to questions such as ‘What products were often purchased together?’ , or ‘What are the subsequent purchases after buying a gas cooker?’ . Of course, these questions could be answered using SQL but proceeding in such a way could take days or months to solve the problem, while a data mining algorithm could find the answers automatically in REFERENCES [1] R. Agrawal, T. Imielinski, and A. Swami, ‘Mining association rules between sets of items in large databases’ . In Proc. of the ACM SIGMOD Conference Management of Data, Washington D.C., May 1993. [2] P. Adriaans, D. Zantinge, ‘Data mining’ , Addison-Wesley, 1999. [3] J. Han, M. Kamber, ‘Data Mining: Concepts and Technique’ , University of Illinois, 2002 [4] U. Fayyad, S. Chaudhuri, P. Bradley, ‘Data mining and its role in database systems’ , 1999 [5] D. V. Thanh, P. T. Hoan, P. X. Hieu, N. T. Trung, ‘Khai phá lu WN WK SY L K WU NK{QJ JL QJ QKDX¶ >0LQLQJ DVVRFLDWLRQ UXOHV ZLWK GLIIHUHQW VXpports], Conference of junior scientists of Vietnam Nat’l Univ. Hanoi, pages 475-483, 2002 [6] C. H. Cai, ‘Mining association rules with weighted items’ , Thesis for degree of master, Chinese University of Hongkong, 1998 [7] M. J. Zaki, C. J. Hsiao, ‘CHARM: An efficient algorithm for closed itemset mining’ , 2002 [8] L. A. Zadeh, Fuzzy sets, Informat. Control, 338-353, 1965.