Page 54 -
P. 54
3:12
Page 17
#17
2011/6/1
HAN 08-ch01-001-038-9780123814791
1.4 What Kinds of Patterns Can Be Mined? 17
Concept description, including characterization and discrimination, is described in
Chapter 4.
1.4.2 Mining Frequent Patterns, Associations, and Correlations
Frequent patterns, as the name suggests, are patterns that occur frequently in data.
There are many kinds of frequent patterns, including frequent itemsets, frequent sub-
sequences (also known as sequential patterns), and frequent substructures. A frequent
itemset typically refers to a set of items that often appear together in a transactional
data set—for example, milk and bread, which are frequently bought together in gro-
cery stores by many customers. A frequently occurring subsequence, such as the pattern
that customers, tend to purchase first a laptop, followed by a digital camera, and then
a memory card, is a (frequent) sequential pattern. A substructure can refer to different
structural forms (e.g., graphs, trees, or lattices) that may be combined with itemsets
or subsequences. If a substructure occurs frequently, it is called a (frequent) structured
pattern. Mining frequent patterns leads to the discovery of interesting associations and
correlations within data.
Example 1.7 Association analysis. Suppose that, as a marketing manager at AllElectronics, you want
to know which items are frequently purchased together (i.e., within the same transac-
tion). An example of such a rule, mined from the AllElectronics transactional database, is
buys(X,“computer”) ⇒ buys(X,“software”) [support = 1%,confidence = 50%],
where X is a variable representing a customer. A confidence, or certainty, of 50%
means that if a customer buys a computer, there is a 50% chance that she will buy
software as well. A 1% support means that 1% of all the transactions under analysis
show that computer and software are purchased together. This association rule involves
a single attribute or predicate (i.e., buys) that repeats. Association rules that contain a
single predicate are referred to as single-dimensional association rules. Dropping the
predicate notation, the rule can be written simply as “computer ⇒ software [1%, 50%].”
Suppose, instead, that we are given the AllElectronics relational database related to
purchases. A data mining system may find association rules like
age(X, “20..29”) ∧ income(X, “40K..49K”) ⇒ buys(X, “laptop”)
[support = 2%, confidence = 60%].
The rule indicates that of the AllElectronics customers under study, 2% are 20 to 29 years
old with an income of $40,000 to $49,000 and have purchased a laptop (computer)
at AllElectronics. There is a 60% probability that a customer in this age and income
group will purchase a laptop. Note that this is an association involving more than one
attribute or predicate (i.e., age, income, and buys). Adopting the terminology used in
multidimensional databases, where each attribute is referred to as a dimension, the
above rule can be referred to as a multidimensional association rule.