Ekta Aggarwal

# Decision Trees in Python

In this tutorial we would be understanding how to implement Decision Trees algorithm in Python.

If you wish to understand the theory behind Decision Trees then you can refer to this tutorial: __Working of Decision Trees__

To run Decision Trees in Python we will use ** iris **dataset, which is inbuilt in Python. Iris dataset comprises of data for 150 flowers belonging to 3 different species: Setosa, Versicolor and Virginica. For these 150 flowers their Sepal Length, Sepal Width, Petal Length and Petal Width information is available.

Let us firstly load pandas library

`import pandas as pd`

Now we will load iris dataset from sklearn library

```
from sklearn.datasets import load_iris
iris = load_iris()
```

Following are the variable names in iris dataset

`iris.feature_names`

__Output:____
__['sepal length (cm)',
'sepal width (cm)',
'petal length (cm)',
'petal width (cm)']

Now we are storing the independent variables from iris dataset in X and dependent variable in y

```
X = iris.data
y = iris.target
```

We can see from the shape that X has 150 rows and 4 columns

`X.shape`

__Output:__

(150, 4)

We can see the number of occurences of different species:

`pd.Series(y).value_counts()`

__Output:__

2 50 1 50 0 50 dtype: int64

Now we are splitting the data in training set and test set. Note that we will build our model using the training set and we will use test set to check our performance of the algorithm. We are splitting our data into 80% training set and 20% test set. We can see that training set has got 120 rows and test set has 30 rows.

```
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print(X_train.shape);
print(X_test.shape);
print(y_train.shape);
print(y_test.shape)
```

__Output:__

(120, 4) (30, 4) (120,) (30,)

Let us build our Decision Tree model with default parameters. For this we are loading our libraries:

```
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
```

Now we are building a model for decision tree classifier with default parameters:

`dt = DecisionTreeClassifier()`

We are fitting our decision tree on training set

`dt.fit(X_train,y_train)`

Making the predictions on training and test set:

```
pred_train = dt.predict(X_train)
pred_test = dt.predict(X_test)
```

Using plot_tree function we can visualise what our decision tree looks like:

```
import matplotlib.pyplot as plt
plt.subplots(nrows = 1,ncols = 1,figsize = (14,7))
tree.plot_tree(dt);
```

We can also see how are decision tree is created, in terms of text using export_text function:

```
from sklearn.tree import export_text
r = export_text(dt, feature_names=iris['feature_names'])
print(r)
```

**Output:**

```
|--- petal width (cm) <= 0.80
| |--- class: 0
|--- petal width (cm) > 0.80
| |--- petal length (cm) <= 4.75
| | |--- petal width (cm) <= 1.65
| | | |--- class: 1
| | |--- petal width (cm) > 1.65
| | | |--- class: 2
| |--- petal length (cm) > 4.75
| | |--- petal width (cm) <= 1.75
| | | |--- petal length (cm) <= 4.95
| | | | |--- class: 1
| | | |--- petal length (cm) > 4.95
| | | | |--- petal width (cm) <= 1.55
| | | | | |--- class: 2
| | | | |--- petal width (cm) > 1.55
| | | | | |--- sepal length (cm) <= 6.95
| | | | | | |--- class: 1
| | | | | |--- sepal length (cm) > 6.95
| | | | | | |--- class: 2
| | |--- petal width (cm) > 1.75
| | | |--- petal length (cm) <= 4.85
| | | | |--- sepal length (cm) <= 5.95
| | | | | |--- class: 1
| | | | |--- sepal length (cm) > 5.95
| | | | | |--- class: 2
| | | |--- petal length (cm) > 4.85
| | | | |--- class: 2
```

Now we can calculate our accuracy using accuracy_score function:

`from sklearn.metrics import accuracy_score`

Getting the accuracy for training set

`accuracy_score(y_train,pred_train)`

** Output**: 1.0

Getting the accuracy for test set

`accuracy_score(y_test,pred_test)`

** Output**: 1.0

**Decision Trees Using Grid search**

Earlier wehad built our decision tree using default parameters, but there are 3 parameters in a decision tree to be tuned:

The maximum depth of the tree.**max_depth:**The minimum number of samples required to split a node**min_samples_split:**The minimum number of samples required to be at a leaf node. Let us say min_sample_leaf = 5, thus, If after splitting a node we do not have 5 observations in the child nodes then the parent node will not split.**min_samples_leaf:**

The most optimal way to find the best parameters for a Decision Tree is to use GridSearch

```
from sklearn import metrics
from sklearn.model_selection import GridSearchCV
```

We are creating a parameter grid i.e., mapping the parameter names to the values that should be searched. In this grid we are only tuning max_depth and min_samples_split. We are trying the calues 3,4, and 5 for max_depth and 5, 6 for min_samples_split.

```
param_grid = {'max_depth' : [3,4,5] ,
'min_samples_split' : [5,6]}
print(param_grid)
```

** Output: **{'max_depth': [3, 4, 5], 'min_samples_split': [5, 6]}

We are now defining the Grid Search for our model, specifying our parameter grid. CV = 5 implies 10-fold cross validation with scoring mechanism as 'accuracy'.

```
grid = GridSearchCV(dt, param_grid, cv=5, scoring='accuracy')
```

We are now fitting the model to our training data to obtain the best parameters

`grid.fit(X_train, y_train)`

The best estimator of our model is given as:

`grid.best_estimator_`

** Output:** DecisionTreeClassifier(max_depth=4, min_samples_split=6)

The best 5-fold cross validation accuracy for our model is:

`grid.best_score_`

** Output:** 0.9416666666666668

The best parameters of our model are given as:

`grid.best_params_`

** Output: **{'max_depth': 4, 'min_samples_split': 6}

Now we are fitting out final Decision Tree model using our best estimator i.e.,

```
dt = grid.best_estimator_
dt.fit(X_train,y_train)
```

We are now making the predictions on training and test set:

```
pred_train = dt.predict(X_train)
pred_test = dt.predict(X_test)
```

Accuracy for our test set is given by:

`accuracy_score(y_test,pred_test)`

** Output:** 1.0

We are now plotting our decision tree:

```
import matplotlib.pyplot as plt
plt.subplots(nrows = 1,ncols = 1,figsize = (14,7))
tree.plot_tree(dt);
```