Pattern Discovery 1

Patterns represent intrinsic and important properties of datasets. Pattern discovery is uncovering patterns (inherent regularities) from massive data sets

What products were often purchased together?
What are the subsequent purchases after buying an iPad?

Support, Confidence

;; item bought

Beer, Nut, Diaper  
Beer, Coffee, Diaper  
Beer, Diaper, Eggs  
Nuts, Eggs, Milk  
Nuts, Coffee, Diaper, Eggs, Milk

support s 는 전체 구매 중에서, 해당 아이템이 구매되었을 확률이다. min support 를 50% 로 한다면,

freq 1-itemset: Beer(60%), Nuts(60%), Diaper(80%), Eggs(60%)
freq 2-itemset: {Beer, Diaper} (60%)

confidence c 는 X 가 포함되어있을때 Y 까지 구매되었을 조건부 확률이다. c = sup(X ∪ Y) / sup(X)

association rule 은, min support 와 min confidence 를 정해놓고, 그 이상이 되는 rule X -> Y 를 찾는 것이다.

Beer -> Diaper (s:60%, c:100%)
Diaper -> Beer (s:60%, c:75%)

Closed Patterns

이렇게 패턴을 구하면, 다수개의 물품이 있는 transaction 에서는 너무 많은 패턴이 생긴다.

;; TDB1 

T1: {a1, ..., a50}  
T2: {a1, ..., a100}

min supoort 를 1 이라 하면 1-itemset, 2-itemset, .. 100-itemset 처럼 2^100 - 1 개의 sub pattern 이 생긴다.

이 문제를 해결하기 위해 lossless compression 인 closed pattern 이란 개념을 도입해 보자.

closed pattern: A pattern(itemset) X is closed if X is frequent, and there exists no super pattern Y ⊃ X, with the same support as X

위와 똑같은 TDB 에 대해 closed pattern 을 구하면

P1: "{a1, ..., a50}: 2" , P2: "{a1, ..., a100}: 1"

closed pattern 대신 max pattern 을 이용할 수 있다.

max pattern: A pattern X is a max-pattern if X is frequent and there exists no frequent super-pattern Y ⊃ X

정의를 보면 support 값을 신경쓰지 않는다. 위와 같은 TDB 에 대해 max-pattern 을 찾으면

P: "{a1, ..., a100}: 1

당연히 lossy compression 이다.

Downard Closure Property

Apriori property 라 부르기도 하는데, 핵심은 이렇다.

{beer, diaper, nuts} 가 frequent 하면 subitem 인 {beer diaper} 는 적어도 그 만큼은 frequent 해야한다는 것이다.

Apriori: Any subset of a frequent itemset must be frequent

따라서 S 의 어떤 subset 도 infrequent 하면 S 가 frequent 할 일이 없으므로, 가지치기 할 수 있다는 것이다.

Apriori pruning principle: if there is any itemset which is infreqent, its superset should not even be generated

The Apriori Algorithm

level-wise, candidate generation and test

initially, scan DB once to get frequent 1-itemset

repeat  
  generate length-(k+1) candidate itemsets from length-k frequent itemsets
  test the candidates against DB to find frequent (k+1) itemset
  Set k := k + 1

until no frequent or candidate set can be generated

apirori 알고리즘이 나온 이래로, 성능을 개선하기 위한 많은 기법들이 발표됬다.

(1) Reduce passes of transaction database scans
– partitioning
– dynamic itemset counting

(2) Shrink the number of candidates
– Hashing
– Pruning by support lower bounding
– Sampling

(3) Exploring special data structures
– Tree projection
– H-miner
– Hypercube decomposition

Partitioning

Theorem: Any itemset that is potentially frequent in TDB must be frequent in at least one of the partitions of TDB

어느 local TDB 에서도 frequent 하지 않으면 global TDB 에서도 frequent 하지 않기 때문에, global 에서 frequent 하려면 적어도 partitioning 된 local TDB 중 하나에서는 frequent 해야 한다.

따라서 scan 1 에서 partitining 하고 local pattern 을 찾고, scan 2 에서 global frequent pattern 을 구하면 된다. 딱 2번만 TDB 에 접근한다.