Naive Bayesian Classifier + Bayesian statistics

2013. 2. 6. 16:51

Probabilistic relevance feedback

단어 k가 나타날 확률을 다음의 두 가지 확률의 합으로 결정된다.

관련 문서들 중에서 단어 k가 나타날 확률

관련 없는 문서들 중에서 단어 k가 나타날 확률

Bayesian Methods

Learning and classification methods based on probability theory

Build a generative model that approximates how data is produced.
Uses prior probability of each category given no information about an item.
Categorization produces a posterior probability distribution over the possible categories given a description of an item.

Bayes' theorem

Evidence 혹은 관찰된 데이터 투플 X가 주어질 때, 어떤 특정한 가설 H가 성립할 확률 P(H|X)를 결정하기 원한다. 다시 말해서, 우리가 X의 속성서술(attribute description)을 알고 있을 때, 투플 X가 클래스 C에 속할 확률을 찾고자 한다.

For events A and B:

Prior: P(A)
Posterior: P(A|B)
Likelihood: P(B|A)
Evidence: P(B)
Odds:

Naive Bayesian Classifier

(classification: http://sens.tistory.com/288)

Classify a new instance X based on a (traning) tuple of attribute vector <x₁, x₂, …, x_n> into one of the classes c_j ∈ C (for 1 ≤ j ≤ m).

P(X) = P(x₁, x₂, …, x_n).
P(c_j) can be estimated from the frequency of classes in the training examples.
P(x₁, x₂, …, x_n|c_j) could only be estimated if a large number of training examples were available.
If the prior probability is unknown, we generally assume P(c₁) = P(c₂) = … = P(c_m) and maximize P(X|c_j). Otherwise, we maximize P(X|c_j)·P(c_j).

Naive Bayesian Classifier predicts that the instance(tuple) X is in the class c_MAP:

c_MAP is called maximum posteriori hypothesis. (MAP means maximum a posteriori.)
In mathematics, arg max stands for the argument of the maximum, that is to say, the set of points of the given argument for which the given function attains its maximum value:

P(X) is just a constant over all classes, which is ignored in arg max calculation.

Naive Bayes Conditional Independence Assumption

Given a training set with a number of attributes, the cost of calculation for P(x₁, x₂, …, x_n|c_j) is too high. To reduce the cost of calculation, we should assume a simple hypothesis:

Assume that the probability of observing the conjunction of attributes is equal to the product of the individual probabilities P(x_i|c_j).

Conditional Independence Assumption: features detect term presence and are independent of each other given the class:

This model is appropriate for binary variables. (Multivariate Bernoulli model)
If A_k is categorical, we attempt maximum likelihood estimates. (simply use the frequencies in the data)

else if A_k is continuous, the probability P(x_i|c_j) forms Gaussian distribution:

Smoothing to Avoid Overfitting

Problem with Max Likelihood:

If the amount of training tuples is insufficient, P(X|c_j) = 0 and then P(c_j|X) = 0.

Solution:

Using Laplacian correction: adding 1 to each case

Example:

Suppose a dataset with 1000 tuples, n_income=low = 0, n_{income=medium} = 990, n_income=high = 10,
Pr(income=low) = 1/1003,
Pr(income=medium) = 991/1003,
Pr(income=high) = 11/1003.

Examples

#1: 미국 대통령들의 연설문 분석

#2: buys_computer

c₁: buys_computer = yes

c₂: buys_computer = no

X = (age = youth, income = medium, student = yes, credit_rating = fair)

We should maximize P(X|c_j)·P(c_j) with j = 1, 2.

training set

age	income	student	credit_rating	buys_computer
~30	high	no	fair	no
~30	high	no	excellent	no
31~40	high	no	fair	yes
40~	medium	no	fair	yes
40~	low	yes	fair	yes
40~	low	yes	excellent	no
31~40	low	yes	excellent	yes
~30	medium	no	fair	no
~30	low	yes	fair	yes
40~	medium	yes	excellent	yes
~30	medium	yes	excellent	yes
31~40	medium	no	excellent	yes
31~40	high	yes	fair	yes
40~	medium	no	excellent	no

P(buys_computer = yes) = 9/14 = 0.643

P(buys_computer = no) = 5/14 = 0.357

P(age = youth | buys_computer = yes) = 2/9 = 0.222

P(age = youth | buys_computer = no) = 3/5 = 0.600

P(income = medium | buys_computer = yes) = 4/9 = 0.444

P(income = medium | buys_computer = no) = 2/5 = 0.400

P(student = yes | buys_computer = yes) = 6/9 = 0.667

P(student = yes | buys_computer = no) = 1/5 = 0.200

P(credit_rating = fair | buys_computer = yes) = 6/9 = 0.667

P(credit_rating = fair | buys_computer = no) = 2/5 = 0.400

P(X|buys_computer = yes)

= P(age = youth | buys_computer = yes) × P(income = medium | buys_computer = yes)

　× P(student = yes | buys_computer = yes) × P(credit_rating = fair | buys_computer = yes)

= 0.222 × 0.444 × 0.667 × 0.667 = 0.044

P(X|buys_computer = no)

= P(age = youth | buys_computer = no) × P(income = medium | buys_computer = no)

　× P(student = yes | buys_computer = no) × P(credit_rating = fair | buys_computer = no)

= 0.600 × 0.400 × 0.200 × 0.400 = 0.019

P(X|buys_computer = yes)·P(buys_computer = yes) = 0.044 × 0.643 = 0.028

P(X|buys_computer = no)·P(buys_computer = no) = 0.019 × 0.357 = 0.007

According to the above, the Naive Bayesian Classifier predicts that buys_computer = yes.

Time Complexity

Very efficient overall, linearly proportional to the time needed to just read in all the data.

Training Time:

L_d is the average length of a document in D.
Assume V and all D_i, n_i, and n_ij pre-computed in O(|D|L_d) time during one pass through all of the data.
Generally just O(|D|L_d) since usually |C||V| < |D|L_d

Test Time:

L_t is the average length of a test document

Underflow Prevention: log space

Multiplying lots of probabilities, which are between 0 and 1 by definition, can result in floating-point underflow. Since log(xy) = log(x) + log(y), it is better to perform all computations by summing logs of probabilities rather than multiplying probabilities. Class with highest final un-normalized log probability scores is still the most probable.

Comments

Advantages
　- Easy to implement
　- Good results obtained in most of the cases
Disadvantages
　- Assumption: class conditional independence, therefore loss of accuracy
　- Practically, dependencies exist among variables.
　　(e.g. Symptoms: fever, cough, etc.)
How to deal with these dependencies?
　- Bayesian Belief Networks

Bayesian Belief Networks

Allows a subset of the variables conditionally independent.

A graphical model of causal relationships represents dependency among the variables.

Node: random variables
Link: dependency
X and Y are parents of Z, and Y is the parent of P
No dependency between Z and P
Has no loops or cycles

The conditional probability table(CPT)

CPT shows the conditional probability for each possible combination of its parents.

Example:

The conditional probability table(CPT) for variable LungCancer:

P(LungCancer=yes | FamilyHistory=yes, Smoker=yes) = 0.8

P(LungCancer=no | FamilyHistory=yes, Smoker=yes) = 0.2

P(LungCancer=yes | FamilyHistory=yes, Smoker=no) = 0.5

P(LungCancer=no | FamilyHistory=yes, Smoker=no) = 0.5

P(LungCancer=yes | FamilyHistory=no, Smoker=yes) = 0.7

P(LungCancer=no | FamilyHistory=no, Smoker=yes) = 0.3

P(LungCancer=yes | FamilyHistory=no, Smoker=no) = 0.1

P(LungCancer=no | FamilyHistory=no, Smoker=no) = 0.9

Derivation of the probability of a particular combination of values of X from CPT:

(y_j is the parent of x_i; 1 < j < m)

(참고) Bayesian statistics

통계적 분석은 '고전학파'와 '베이지안학파'로 나뉘어졌다고 볼 수 있다.

고전 확률(Classical probability)

a.k.a. 빈도주의(frequentism)
주장: 한 사건(event)의 확률(probability)은 그것이 일어나는 '빈도 수(frequency)'
대표학자: Ronald A. Fisher (1890-1962), Bradley Efron (1938- )

고전확률의 문제점

간단한 예로, 항아리에 검은 공과 하얀공이 들어 있고, 여기서 공을 하나씩 꺼내어 확인하는 과정을 생각해보자.

무한번 시행을 해야만 정확한 값을 알 수 있다는 것이다. 시행 횟수가 굉장히 적은 경우, 그 결과를 가지고 얘기할 수 있는 것은 거의 없다. 위의 예에서 한번 시행해서 하얀공이 나왔다고 해보자. 그러면 P=1 이라고 해야하는가? (사실은 그래서 시행 횟수가 적은 경우, 확률값이 어느 범위에 존재할 확률을 생각하여 해결하긴 한다.)
반복시행 자체를 할 수 없는 경우에도 우리는 일상적으로 확률의 개념을 사용하는데, 이를 Frequentist interpretation에서는 다룰 수 없다. 예를 들어 "다음 대선에 ○○○가 당선될 확률"을 생각해보자. "다음 대선"이라는 것은 역사상 딱 한 번만 일어나는 일인데다가 이 확률을 우리는 그 이전에 계산해내고 싶은 것이다. 하지만 우리는 이 경우에도 여러가지 관련된 사실을 이용하여 (여론조사 등을 이용해서) 이 확률을 생각해볼 수 있다.
완벽히 조건이 같은 독립적인 반복 시행은 원칙적으로 존재하지 않는다. 처음에 하얀공만 10번 나왔다가 그 후로 검은공만 나온다고 해보자. 10번 시행했을 때 우리는 P=1이라고 추정하지만 실제로는 P=0에 가깝다고 할 수 있겠다. 그럼 10번 시행했을때의 추정값 P=1을 어떻게 믿고 사용하겠는가? 물론 실제로 이런 일은 자주 벌어지지 않는다. 왜냐하면, 각각의 시행이 같은 조건에 이루어지며 서로 독립적이라는 가정이 보통의 상황에서는 성립하기 때문이다. 하지만, 엄밀히 말해서 각각의 시행은 시간 순서대로 일어나고, 나중의 시행의 결과는 전 시행의 결과를 안 이후의 조건부 결과이기때문에 원칙적으로는 독립이 아니다.

베이지안 확률(Bayesian probability)

a.k.a. 주관주의(subjectivism, subjective probability), 개인적확률(personal probability), 인식론적확률(epistemic probability), 논리적 확률(logical probability)
주장: 한 사건의 확률은 믿음(belief), 혹은 의지(willingness)에 기반하는 것
대표학자: Thomas Bayes (1702–1761), Leonard J. Savage (1917-1971)

베이지안 확률은 이런 문제점을 해결하기 위해 확률의 개념을 좀 더 확장한다. 반복 시행을 할 수 없는 명제나 상황이라 해도, 이것의 "불확실한 정도" 혹은 "믿음의 정도"를 확률로 할당하는 것이다. 즉, 확률을 불확실한 상황에서의 (Boolean) 진리값의 (이산 변수에서 연속 변수으로의) 확장으로 간주하는 것이다. (즉, 어떤 명제가 얼마나 진리에 가까운 것인지) 그런데 이렇게 객관적이지 않아 보이는 양을 어떻게 계산할 수 있을까? 바로 이 때 베이지안 이론이 필요하다.

베이지안 확률에서는 해당 명제를 판단할 증거(Evidence)가 나타나기 이전에도 확률(믿음의 정도)이 존재하고, 이를 사전확률(Prior Probability)이라고 한다. 전혀 아무런 지식이 없는 경우에도 이는 존재하지만(non-informative prior), 보통은 과거의 증거들을 종합한 결과가 될 것이다. 자, 이제 증거 D가 관찰된다. 이는 명제와 관련된 어떠한 관찰결과라도 상관없다. 해당 명제가 성립한다고 했을 때, 그 증거가 관찰될 확률 P(D|w) - 우도(Likelihood)만 알면 된다. 그리하여 우리는 여기서 증거를 관찰한 이후 그 명제의 확률(믿음의 정도)이 어떻게 변하는지를 알아낼 수 있다. 이를 사후확률(Posterior Probability)라고 한다. P(D)는 별로 중요한 양은 아니고, Posterior의 Normalization constant 정도로 생각하면 된다.

베이지안 통계

베이지안 통계에서 h(θ)는 'Θ의 사전확률분포(prior p.d.f. of Θ)'라 불리며, k(θ|y)는 'Θ의 사후확률분포(prior p.d.f. of Θ)'라 불린다. 그것은 h(θ)가 Y의 관측(observation) 이전의 Θ의 분포이고, k(θ|y)는 Y의 관측 이후의 Θ의 분포이기 때문이다.

사전확률분포(Prior)
많은 경우, 사전확률분포 h(θ)의 형태는 알고 있지 못하다.
따라서 분석가는 때로 자신이 알고 있는 모든 '사전(事前)' 지식을 동원하여 사전확률분포 h(θ)의 형태를 정하게 되는데, 결국 이는 사후확률분포 k(θ|y)의 성능에 영향을 미치게 된다. (따라서 혹자들은 사전확률을 개인적, 혹은 주관적 확률이라 부르기도 한다.)
사후확률분포(Posterior)
사후확률은 Y의 관측값 y에 대한 Θ의 조건부확률임을 확인할 수 있는데, 결국 y의 값은 주어진 값(given value)이기 때문에 사후확률은 θ에 대한 함수라는 것을 알 수 있다. (즉, y는 확률변수가 아님)

f₁(θ) = k(θ|y)

이와 더불어 또 한 가지 추론할 수 있는 아주 중요한 사실은, 뒤에서 언급할 'likelihood'의 선택에 따라 사후확률분포가 사전확률분포와 같은 분포를 따를 수 있다는 점이다. 이를테면, 사전확률분포가 이항분포로 정의되는 경우 사후확률도 이항분포를 따르고, 사후확률분포가 정규분포로 정의되는 경우 사후확률 역시 정규분포를 따를 수 있게 된다는 것이다. (자세한 내용은 'conjugate prior'에 대한 포스팅 참고)

우도(Likelihood)
일반적으로 조건부 확률에서는 P(B|A)라고 하면 'given A'라고 생각되지만, likelihood에서만큼은 예외다. 위의 식에서 볼 수 있듯 likelihood는 사후확률과 마찬가지로 θ에 대한 함수이며, 따라서 'given B'라고 할 수 있다. (따라서 확률의 공리적 정의를 따르지 않는다.)
증거(Evidence)
관례적으로 베이지안 통계에서는 그리 중요하게 다뤄지지 않았지만, 패턴인식의 떠오르는 샛별인 semi-supervised learning에서는 꽤 중요한 영역으로 다루어진다.

References

저작자표시 비영리 변경금지 (새창열림)

SENS

Programming Note