[Machine Learning] Topic 3: Decisoin Tree

Authored by Tony Feng

Created on May 11th, 2022

Last Modified on May 14th, 2022

Intro

A decision tree is one of the supervised machine learning algorithms. Decision Tree algorithm belongs to the family of supervised learning algorithms. The goal is to create a model that predicts the value of a target variable by learning if-else conditions(decision rules) inferred from the data features.

Types of Decision Tree

Decision Tree (DT) can be used for both regression and classification problems, but it is mostly used for classification problems. Based on the types of target variable, we have 1) categorical-variable DT, and 2) continuous-variable DT.

Non-linearity

Formally, a method is linear if, for an input $ x \isin {\R}^n $ (with interecept term $ x_0=1 $), it only produces hypothesis functions $ h $ of the form:

$$ h(x) = \theta^{T}x, \theta \isin {\R}^n $$

Otherwise, it is non-linear. Decision trees, on the other hand, can directly produce non-linear hypothesis functions without the need for first coming up with an appropriate feature mapping.

Region Selection

In general, selecting optimal regions is intractable. Decision trees generate an approximate solution via greedy, top-down, recursive partitioning. Each node in the tree acts as a test case for some attribute, and each edge descending from that node corresponds to one of the possible answers to the test case.

Formally, given a input space $ X $, a parent region $ R_p $, a feature index $ j $, and a threshold $ t \isin {\R}$, we obtain two child regions $R_1$ and $R_2$ as follows:

$$ R_{1} = \{X \mid X_{j}<t, X \in R_{p} \} $$ $$ R_{2} = \{X \mid X_{j} \geq t, X \in R_{p} \} $$

We can continue in such a manner until we a meet a pre-defined stop criterion, and then predict the majority class at each leaf node.


Attribute Selective Measures

An attribute selection measure (AMS) is a technique used in the data mining process for data reduction. It is a heuristic for choosing the splitting test that “best” separates a given data partition. The three main ASM techniques are 1) Information Gain, 2) Gain Ratio, and 3) Gini Index.

Information Gain

Entropy is used to measure the level of impurity in a group of examples. This concept comes from informatin theory. The higher the entropy, the more the information content. Now, consider a dataset with $n$ classes, then

$$ E = -\sum_{i=1}^{n} p_{i} \log_{2} (p_{i}) $$

, where $ p_i $ is the probability of randomly selecting an example in class $i$.

We can define information gain (IG) as a measure of how much information a feature provides about a class. It is the expected reduction in entropy caused by partitioning the examples according to a given attribute. Given a collection of dataset $ S $, we can calculate the $IG(S, A)$ of an attribute $A$ as:

$$ IG(S, A) = E(S) - \sum_{v \in V} \frac{|S_{v}|}{|S|} E(S_{v}) $$

, where $V$ is all possible values for attribute $A$ and $S_{v}$ is the subset of $S$ for which attribute $A$ has value $v$.

Gain Ratio

Gain Ratio or Uncertainty Coefficient is used to normalize the information gain of an attribute against how much entropy that attribute has. It attempts to lessen the bias of Information Gain on highly branched predictors by introducing a normalizing term called the Intrinsic Information (II).

$$ II(S, A) = - \sum_{v \in V(A)} \frac{|S_{v}|}{|S|} \log_2 \frac{|s_{v}|}{|S|} $$

Formula of Gain Ration is given by

$$ GR(S, A) = \frac{IG(S, A)}{II(S, A)} $$

From this formula, we can see that, with less value in attribute $ A $, $ II(S, A)$ is smaller and thus purity is higher. Informally, the formula of Intrinsic Information is the same as that of Entropy, both of which means the purity of attribute $ A $.

Gini Index

Gini Index, also called Gini Impurity, measures the degree or probability of a particular variable being wrongly classified when it is randomly chosen. If we have $C$ total classes and $p_i$ is the probability of picking a datapoint with class $i$, then the Gini Impurity is calculated as

$$ GI=\sum_{i=1}^{C} p_i(1-p_i) = 1 - \sum_{i=1}^{C} p_i^{2}$$

For a data sample $ S $, we have an attribute $ A $ with a group of values $V$. Therefore,

$$ GI(S, A) = \sum_{v \in V} \frac{|S_{v}|}{|S|} GI(S_{v}) $$

The Gini Index varies between $0$ and $1$, where $0$ represents purity of the classification and $1$ denotes random distribution of elements among various classes. A Gini Index of $0.5$ shows that there is equal distribution of elements across some classes.


Feature Selection Algorithms

ID3

The ID3 algorithm builds decision trees using a top-down greedy search approach through the space of possible branches with no backtracking.

Steps of ID3

  • Begin with the original datset as the root node and calculte its entropy.
  • For each attribute/feature:
    • Calculate entropy for all its categorical values.
    • Calculate information gain for the feature.
  • Find the feature with maximum information gain and split to produce subsets.
  • Continues to recur on each subset until we get the desired tree.

Disadvantages

  • ID3 has no pruning strategy and is easy to overfit.
  • The IG criterion has a preference for features with a number of possible values.
  • It can only be used to deal with discretely distributed features;
  • It does not consider missing values.

C4.5

Compared with the shortcomings of ID3, C4.5 has the following improvements:

  • C4.5 overcomes is the shortcoming of ID3’s emphasis on the number of feature by using the Gain Ratio as the goodness function to split the dataset.
  • C4.5 is able to handle both continuous and discrete attributes.
  • C4.5 has a pruning strategy, which replaces the helpless branches with leaf nodes after the tree is created.
  • C4.5 takes missing values into account.

Disadvantages

  • C4.5 uses a polytree which is less efficient than a binary tree.
  • C4.5 can only be used for classification.
  • C4.5 is computaionally expensive both in time and space.

CART

Classification And Regression Trees (CART) is a recursive partitioning method, builds classification and regression trees for predicting continuous dependent variables (regression) and categorical predictor variables (classification).

The classification algorithm for building a decision tree is based on Gini impurity as splitting criterion, while the regression tree uses square error.


Tricks in Decision Tree Construction

Pruning

Pre-pruning:

  • The size of nodes is smaller than the threshold.
  • The depth of the tree is deeper than the threshold.
  • All the attributes have been checked.
  • Accuracy on the validation set becomes lower after splitting.

Post-pruning (Readings):

  • Reduced-Error Pruning
  • Pessimistic Error Pruning

Continuous Variables

Since some attributes has a continuous property, we dealt with the continuous values according to the following manner. Suppose the attribute $ A $ has $ a_1, a_2, … , a_n $, then $ v $ equals $ n − 1 $. The values are sorted in an ascending order to calculate each threshold.

$$ threshold(k)=\frac{a_{k}+a_{k+1}}{2} $$

, where $ a_{k}<a_{k+1} $ and $ 1 \leq k \leq n-1 $. The optimal split will be found when it achieves the maximum purity.

Missing Values

Case 1: When an attribute has missing values, we calulate the ASM without considering the data with missing values. Then, we normalize the ASM based on the proportion of data with missing values.

Case 2: We find a split criterion, but some data samples don’t have the value of that attribute. We distribute those data into sub-nodes based on the ratio of the size of those sub-nodes created by normal data.

Case 3: When the testing data has missing values, let each abnormal data go through every branch and see which branch results in the highest probability.


Reference


MIT License
Last updated on May 18, 2022 09:16 EDT
Built with Hugo
Theme Stack designed by Jimmy