[Data Science] Frequent Pattern Mining

Motivation

Apriori 알고리즘 및 향상법에 대해 candidate-generation와 test 프로세스가 대부분의 시간을 잡아먹는다. (bottle neck)
그렇다면 candidate generation을 안 하면 되는 거 아닌가?

FP-Growth

Main Idea

Local frequent item을 가지고 더 긴 frequent pattern을 찾아내자..!
예를 들어, 'A'가 frequent pattern이라고 가정한다면, 'A'를 가지고 있는 모든 transaction을 DB|A라고 하자. 만약 DB|A에서 'B'가 local frequent pattern라면 'AB'는 frequent pattern이다. DB|AB에서 recursive하게 반복...
- 즉, 계속 길이를 growth 해나가며 일련의 frequent pattern을 구하는 방법이다.

Process

Global FP-tree를 만든다.
Divide and conquer 전략을 사용해 frequent pattern을 구한다.

Construct Global FP-tree from Database

다음과 같은 transaction이 있다고 가정하자.
Minimum support = 3

Transaction ID	Items	Frequent Items (frequency descending order)
1	{A, B, D, E}	{E, B, A, D}
2	{B, C, E}	{E, B}
3	{A, B, D, E}	{E, B, A, D}
4	{A, C, E}	{E, A}
5	{B, D, E}	{E, B, D}

DB를 한 번 스캔하여 frequent 1-itemsets을 찾는다.
- 위의 DB에서 frequent item은 A: 3, B: 4, D:3, E: 5이다.
Frequent 1-itemset을 frequency descending order로 정렬한다. 이를 F-list라고 한다.
- F-list = E-B-A-D
DB를 다시 스캔하여 각각의 transaction에 대해 frequent 1-item을 구한다.
- 위의 Frequent Items가 예시이다.
FP-tree를 만든다.

How to construct FP-tree

Transaction ID	Frequent Items (frequency descending order)
1	{E, B, A, D}
2	{E, B}
3	{E, B, A, D}
4	{E, A}
5	{E, B, D}

Transaction ID를 순차적으로 탐색하면서 FP-tree를 생성한다.
이때, root node를 { }로 설정한다.
Transaction 1

Transaction 2

Transaction 3

Transaction 4

Transaction 5

Transaction을 다 스캔하면 FP-tree가 완성된다. 새로운 order에 대해서는 branch를 만들어 tree를 구성한다.

Header Table

FP-tree 이외에도 header table을 만들어야 한다.
Head는 먼저 생긴 순서대로 설정되며 linked list로 구성된다.

FP-Growth - divide and conquer

앞에서 만든 FP-tree를 가지고 divede and conquer 방식을 이용해 frequent pattern을 구할 것이다.

Partitioning Target Frequent Patterns

General Idea: Frequent pattern은 F-list에 따라 disjoint subset으로 나뉠 수 있다.
F-list = E-B-A-D
위 F-list의 역순으로 transaction을 분류할 수 있다.
- T1: D를 포함하는 frequent pattern
- T2: A를 포함하는 frequent pattern (D는 포함하지 않음)
- T3: B를 포함하는 frequent pattern (A, D는 포함하지 않음)
- T4: E를 포함하는 frequent pattern (사실상 없음)

Constructing P-conditional Pattern Base

Conditional pattern base: 특정 아이템을 포함하는 transaction의 prefix path를 추출하여 만든 데이터 집합이다.
분류된 target을 기반으로 conditional pattern base를 만들어야 한다.

Header table에서 head node로 이동한다.
Global FP-tree에서 모든 head node의 prefix path와 횟수를 conditional pattern base에 저장한다.

Case: D

Item	cond. pattern base	Target
D	EBA: 2	T1

Item	cond. pattern base	Target
D	EBA: 2, EB: 1	T1

위와 같이 모든 아이템에 대해 conditional pattern base를 구해주면 다음과 같은 표가 완성된다.

Item	cond. pattern base	Target
E	{ }	T4
B	E: 4	T3
A	EB: 2, E: 1	T2
D	EBA: 2, EB: 1	T1

Getting Frequent Pattern from Conditional Pattern-base

Item	cond. pattern base	Target
E	{ }	T4
B	E: 4	T3
A	EB: 2, E: 1	T2
D	EBA: 2, EB: 1	T1

Conditional pattern base에서 minimum support가 적용된 FP-tree를 생성할 수 있다.
Minimum support = 3, 위의 표에서 item D의 경우 EBA가 2, EB가 1이다. 총 EB는 3, A는 2이다. 즉, A는 minimum support를 만족하지 못하므로 EB로만 FP-tree를 생성한다.

Single Prefix Case

Conditional pattern base에서 FP-tree를 생성했을 때, 위의 예시와 같이 path가 하나밖에 없는 경우가 존재한다. 이 경우를 single prefix case라고 한다.
이때, single prefix로 조합할 수 있는 모든 pattern + P가 frequent itemset이 된다.
D-conditional FP-tree에서 나올 수 있는 모든 frequent itemset은 D, ED, BD, EBD.이다

Recursion Case

Single prefix case와 같이 path의 수가 하나가 아닌 여러 개인 경우도 존재한다. 이 경우를 recursion case라고 한다.
D-conditional pattern base가 EB: 3, A: 3이라고 가정한다면, D-conditional FP-tree는 다음과 같다.

이 경우, F-list를 다시 만들고 conditional pattern base를 다시 구하여 frequent pattern을 찾아야 한다.

Conditional pattern base of 'BD': 만들 수 있는 frequent itemset - BD, BDE

Conditional pattern base of 'AD': 만들 수 있는 frequent itemset - AD

Conditional pattern base of 'ED': 만들 수 있는 frequent itemset - ED

All frequent pattern related to D: D, AD, BD, ED, BDE

Pseudo Code

procedure FP-Growth (Tree, A) {
	if Tree contains a single path P then
    	for each B = nodes combination in P do
        	pattern = B U A
            support = min (support of the nodes in B)
    else
    	for each ai in the header of Tree do
        	pattern B = ai U A;
            with support = ai.support
            construct conditional pattern base of B
            TreeB = construct conditional FP-tree of B
            if TreeB != EMPTY_SET then
            	call FP-Growh(TreeB, B)
}

Benefits

FP-tree Structure

Completness
- Loss 없이 모든 정보를 완전하게 저장한다.
- 어떤 transaction이든 long pattern을 무시하지 않는다.
Compactness
- Original DB를 유지할 필요가 없다. FP-tree에 모든 데이터를 다 저장할 수 있다. → 불필요한 아이템을 저장하지 않음.
- Item이 frequency descending order로 정리되어있다. Root에 가까울수록 frequent하므로 공유가 쉽다 (빠르게 접근 가능).
- 크기가 작기 때문에 메모리에 올리기 쉽다.

FP-Growth

Divide and conquer
- 작업을 쪼개서 효율적이다.
- Original huge DB를 작은 DB로 만들어 관리해서 효율적이다.
Other factors
- Candidate generation, test를 진행하지 않아 좋다..!
- FP-tree structure를 사용하여 메모리에서 관리하기 좋다.
- 두 번의 DB scan 이후 전체 DB를 관리할 필요가 없다.
  1. Counting the frequency of each item
  2. Building the global FP-tree

[Data Science] Frequent Pattern Mining - FP-Growth

Motivation

FP-Growth

Main Idea

Process

Construct Global FP-tree from Database

How to construct FP-tree

Header Table

FP-Growth - divide and conquer

Partitioning Target Frequent Patterns

Constructing P-conditional Pattern Base

Getting Frequent Pattern from Conditional Pattern-base

Single Prefix Case

Recursion Case

Pseudo Code

Benefits

FP-tree Structure

FP-Growth

티스토리툴바