'datascience' 태그의 글 목록

[Data Science] Classification - Evaluation

Lazy vs. Eager Learning 모델 학습 방법에는 두 가지 접근 방식이 있다. Eager learning: Training data와 test data를 나누어 학습을 하는 방식이다. Training data를 통해 모델을 학습시킨 후, test data에 대해 모델을 적용시킨다. 대부분의 머신러닝 방법은 eager learning에 속한다. Lazy learning: Training data를 저장한 상태로 추가적인 학습은 진행하지 않고 (또는 최소한의 학습만 하거나), test data가 주어질 때까지 대기한다. Test data가 주어지면 정해진 알고리즘에 따라 분류를 진행한다. Training에 훨씬 적은 시간을 사용하지만 predicting에는 더 많은 시간을 사용한다. K-Near..

format_list_bulleted Data Science
· 2024. 4. 20.
textsms

[Data Science] Classification - Random Forest

Random Forest Decision tree는 하나지만 random forest는 나무가 최소 500개 이상이다. 나무 혼자 뭘 할 수 있는데. 아무튼 많은 수의 decision tree를 사용하여 성능이 좋은 다수의 decision tree의 특징을 모아 최종적인 decision tree를 만들어내는 기법이다. Algorithm 1. Bootstrap sample (*Bootstrap: 유사 데이터 여러 개 만들어내는 기법): 데이터 중복되어도 괜찮고, 순서 섞여도 문제 없다. 오히려 좋아. 2. 무작위 feature set에 대해 tree를 생성한다. 진짜 많이. 3. Aggregation: majority voting을 사용하여 여러 모델에서 classification을 잘하는 feature를..

format_list_bulleted Data Science
· 2024. 4. 17.
textsms

[Data Science] Classification - Decision Tree

Decision Tree Decision tree는 feature를 기반으로 데이터를 나누어가며 leaf node서 데이터의 class label을 결정하는 top-down 방식의 모델이다. Algorithm Basic idea: greedy & recursive & divide and conquer 종료 조건: 데이터가 같은 class label 밖에 남지 않았을 때. 더이상 partition할 수 있는 feature가 없을 때. Root node만 있는 빈 트리에서 시작한다. 선택한 feature에 따라 데이터를 나눈다. (partitioning) 각각의 나뉜 데이터에 대해: 3. 종료 조건이 만족되면 예측을 진행한다. 4. 그렇지 않다면 2번으로 돌아가 partitioning을 반복한다. Decis..

format_list_bulleted Data Science
· 2024. 4. 17.
textsms

[Data Science] Classification - Overview

Classification Classification은 명확하게 구분되는 선택지(categorical class labels - descrete or nominal)를 예측하는 것이다. Training set으로 모델을 학습시키고 test set에 대해 분류를 진행한다. Regression의 경우, 연속적인 값에 대해 예측을 진행한다는 점이 classification과 다르다. 예를 들어, classification은 내일 날씨가 더울지 추울지 예측한다면, regression은 내일 온도가 몇 도일지 예측하는 것이다. Training data 각각의 데이터는 (feature / attribute)로 이루어져있다. Class label은 여러 class 중 하나에 속해있다고 가정한다. Model Class..

format_list_bulleted Data Science
· 2024. 4. 17.
textsms

[Data Science] Frequent Pattern Mining - Association and Correlations

Mining Max pattern & Closed pattern Frequent itemset을 찾는 것뿐만 아니라 max pattern과 closed pattern도 찾는 것이 중요 정보를 얻는 과정에 속할 수 있다. Max pattern과 closed pattern에 대해서는 이전 게시글 참조. 2024.03.28 - [Data Science] - [Data Science] Frequent Pattern Mining [Data Science] Frequent Pattern Mining Frequent Pattern Mining Frequent pattern mining은 수 많은 데이터 중에서 자주 등장하는 항목의 집합, 하위 시퀀스, 하위 구조 등의 패턴을 식별하고 분석하는 프로세스이다. Frequ..

format_list_bulleted Data Science
· 2024. 4. 15.
textsms

[Data Science] Frequent Pattern Mining - FP-Growth

Motivation Apriori 알고리즘 및 향상법에 대해 candidate-generation와 test 프로세스가 대부분의 시간을 잡아먹는다. (bottle neck) 그렇다면 candidate generation을 안 하면 되는 거 아닌가? FP-Growth Main Idea Local frequent item을 가지고 더 긴 frequent pattern을 찾아내자..! 예를 들어, 'A'가 frequent pattern이라고 가정한다면, 'A'를 가지고 있는 모든 transaction을 DB|A라고 하자. 만약 DB|A에서 'B'가 local frequent pattern라면 'AB'는 frequent pattern이다. DB|AB에서 recursive하게 반복... 즉, 계속 길이를 grow..

format_list_bulleted Data Science
· 2024. 4. 1.
textsms