Decision Tree Machine Learning Algorithm with Python Code

decision tree
Spread the love

A decision tree is a graphical representation of all the possible solutions to a decision based on certain conditions. It is a Supervised learning algorithm that can be used for both classification and Regression problems, but most of the time it’s preferred for solving a classification problem. A Decision tree is a flowchart-like tree structure, where internal nodes represent the feature of a dataset, branches represent the decision rules and each leaf node represents the outcome.

What is machine learning?

The Purpose of the Decision Tree is to create a training modal that can be used to predict the class or value of a target variable by learning simple decision rules inferred from training data.

Decision Tree Algorithms Terminologies

Root node: The root node is from where the decision tree starts, it represents the entire population or simply, and this further gets divided into two or more homogeneous sets.
Leaf Node: Final output nodes are called leaf nodes, and a tree cannot be split further after getting a leaf node.
Splitting: the process of making a node into two or more sub-node.
Branch/ Sub-Tree: a subtree of the main tree is called a branch or sub-tree.
Pruning: It is the process of reducing the unwanted branches from the tree.
Parent/Child Node: A node that is divided into a sub-node is called the parent node of sub-nodes whereas sub-nodes are called a child of a parent node

Decision Tree algorithm working Procedure

Decision trees use various algorithms to decide the root and to split a node into sub-nodes. i.e. ID3(Interactive Dichotomiser 3) Algorithm use of deciding root nodes and splitting a node into a sub node.
If we want to learn the ID3 algorithm we need to know Entropy (H) and Gain (G)

Entropy

In data science, entropy is used as a way to measure how “mixed” a column is. Specifically,
entropy is used to measure disorder. Through entropy, we understand how well partitioning a data set can partition out the target variable. Partitioning means taking different values of the target variable for different values of a feature.

The formula of Entropy

H(S) = − (Pi+log2Pi++ Pi−log2Pi−)

Where H( S) is used to find out the entropy of the current dataset
H( S) = Entropy of the current/main dataset
Pi+ = Probability of Positive (Yes) Class in S
P i- = Probability of Negative (No) Class in S

Information Gain

Information gain is the measurement of changes in entropy after the segmentation of a dataset based on an attribute. It calculates how much information a feature provides us about a class. We will select the root that has more gain than others.

The formula of Gain:

Gain (S, F) = H(S) − v ε F ∑ P(v)×H(S v)

S = Target dataset
F = Feature
v ε F= Each value of Feature F
H(S)= Entropy for the selected attribute of feature F
P(v) = Probability of selected attribute of feature F

Application of Decision Tree

  1. Healthcare industries: In healthcare industries, the decision tree can tell whether a patient is suffering from a disease or not based on his weight, sex, age, and another factor.
  2. Educational sector: In school, college, or university, a student is eligible for a scholarship or not based on the result, financial status, family income, etc. can be decided on a decision tree.
  3. Banking sector: A person is eligible for a loan or not based on his salary, family member or financial status, etc. it can be decided on a decision tree.

Implementation of this example using Python (Jupyter notebook)

Step 1: Read dataset through pandas

import pandas as pd
dataset = pd.read_csv('data.csv')
dataset
DayOutlookTemperatureRoutinePlay
Day1SunnyColdIndoorNo
Day2SunnyWarmOutdoorNo
Day3CloudyWarmIndoorNo
Day4SunnyWarmIndoorNo
Day5CloudyColdIndoorYes
Day6CloudyColdOutdoorYes
Day7SunnyColdOutdoorYes
data set

Step 2: Check missing value, if any then handle it

dataset.isnull().sum()
#there is no null value.

# output
Day 0
Outlook 0
Temperature 0
Routine 0
Play 0
dtype: int64

Step 3: Data preprocessing

The machine can not work with string values so, we need to convert a string to a numeric value. that’s why preprocessing is required. Many ways to preprocess data, one of these is label encodings from sklearn.

from sklearn.preprocessing import LabelEncoder
le_x = LabelEncoder()
x =
dataset[['Outlook','Temperature','Routine']].apply(LabelEncoder().fit_transfo
rm)
x
OutlookTemperatureRoutine
100
111
010
110
000
001
101
preprocessed data

Step 4: create decisionTreeClassifier model and train it

from sklearn.tree import DecisionTreeClassifier
modal = DecisionTreeClassifier()
modal.fit(x, dataset.Play)

# Output

DecisionTreeClassifier()

Step 5: predict data using some data

import numpy as np
x_test = np.array([1,0,1]) # 1-> Sunny, 0-> Cold, 1-> Outdoor; according to
preprocessed table
modal.predict([x_test])[0]

# Output

'Yes'

Spread the love

About Anisur Rahman Shahin

Hello. My name is Shahin. I'm a tech enthusiast guy. Personally, I’m Optimistic and always in hurry kinda person. I'm a freelance web developer. I am working at Zakir Soft as Laravel Developer. My Portfolio website: https://devshahin.com/

View all posts by Anisur Rahman Shahin →

Leave a Reply

Your email address will not be published.