A decision tree is a graphical representation of all the possible solutions to a decision based on certain conditions. It is a Supervised learning algorithm that can be used for both classification and Regression problems, but most of the time it’s preferred for solving a classification problem. A Decision tree is a flowchart-like tree structure, where internal nodes represent the feature of a dataset, branches represent the decision rules and each leaf node represents the outcome.
The Purpose of the Decision Tree is to create a training modal that can be used to predict the class or value of a target variable by learning simple decision rules inferred from training data.
Decision Tree Algorithms Terminologies
Root node: The root node is from where the decision tree starts, it represents the entire population or simply, and this further gets divided into two or more homogeneous sets.
Leaf Node: Final output nodes are called leaf nodes, and a tree cannot be split further after getting a leaf node.
Splitting: the process of making a node into two or more sub-node.
Branch/ Sub-Tree: a subtree of the main tree is called a branch or sub-tree.
Pruning: It is the process of reducing the unwanted branches from the tree.
Parent/Child Node: A node that is divided into a sub-node is called the parent node of sub-nodes whereas sub-nodes are called a child of a parent node
Decision Tree algorithm working Procedure
Decision trees use various algorithms to decide the root and to split a node into sub-nodes. i.e. ID3(Interactive Dichotomiser 3) Algorithm use of deciding root nodes and splitting a node into a sub node.
If we want to learn the ID3 algorithm we need to know Entropy (H) and Gain (G)
Entropy
In data science, entropy is used as a way to measure how “mixed” a column is. Specifically,
entropy is used to measure disorder. Through entropy, we understand how well partitioning a data set can partition out the target variable. Partitioning means taking different values of the target variable for different values of a feature.
The formula of Entropy
H(S) = − (Pi+log2Pi++ Pi−log2Pi−)
Where H( S) is used to find out the entropy of the current dataset
H( S) = Entropy of the current/main dataset
Pi+ = Probability of Positive (Yes) Class in S
P i- = Probability of Negative (No) Class in S
Information Gain
Information gain is the measurement of changes in entropy after the segmentation of a dataset based on an attribute. It calculates how much information a feature provides us about a class. We will select the root that has more gain than others.
The formula of Gain:
Gain (S, F) = H(S) − v ε F ∑ P(v)×H(S v)
S = Target dataset
F = Feature
v ε F= Each value of Feature F
H(S)= Entropy for the selected attribute of feature F
P(v) = Probability of selected attribute of feature F
Application of Decision Tree
- Healthcare industries: In healthcare industries, the decision tree can tell whether a patient is suffering from a disease or not based on his weight, sex, age, and another factor.
- Educational sector: In school, college, or university, a student is eligible for a scholarship or not based on the result, financial status, family income, etc. can be decided on a decision tree.
- Banking sector: A person is eligible for a loan or not based on his salary, family member or financial status, etc. it can be decided on a decision tree.
Implementation of this example using Python (Jupyter notebook)
Step 1: Read dataset through pandas
import pandas as pd
dataset = pd.read_csv('data.csv')
dataset
Day | Outlook | Temperature | Routine | Play |
Day1 | Sunny | Cold | Indoor | No |
Day2 | Sunny | Warm | Outdoor | No |
Day3 | Cloudy | Warm | Indoor | No |
Day4 | Sunny | Warm | Indoor | No |
Day5 | Cloudy | Cold | Indoor | Yes |
Day6 | Cloudy | Cold | Outdoor | Yes |
Day7 | Sunny | Cold | Outdoor | Yes |
Step 2: Check missing value, if any then handle it
dataset.isnull().sum()
#there is no null value.
# output
Day 0
Outlook 0
Temperature 0
Routine 0
Play 0
dtype: int64
Step 3: Data preprocessing
The machine can not work with string values so, we need to convert a string to a numeric value. that’s why preprocessing is required. Many ways to preprocess data, one of these is label encodings from sklearn.
from sklearn.preprocessing import LabelEncoder
le_x = LabelEncoder()
x =
dataset[['Outlook','Temperature','Routine']].apply(LabelEncoder().fit_transfo
rm)
x
Outlook | Temperature | Routine |
1 | 0 | 0 |
1 | 1 | 1 |
0 | 1 | 0 |
1 | 1 | 0 |
0 | 0 | 0 |
0 | 0 | 1 |
1 | 0 | 1 |
Step 4: create decisionTreeClassifier model and train it
from sklearn.tree import DecisionTreeClassifier
modal = DecisionTreeClassifier()
modal.fit(x, dataset.Play)
# Output
DecisionTreeClassifier()
Step 5: predict data using some data
import numpy as np
x_test = np.array([1,0,1]) # 1-> Sunny, 0-> Cold, 1-> Outdoor; according to
preprocessed table
modal.predict([x_test])[0]
# Output
'Yes'