• Ekta Aggarwal

Grid Search in Python

Grid Search is used to find the best parameters for a model. In this tutorial we would be understanding how to implement Grid Search in Python.


If you wish to understand the theory behind Grid Search then you can refer to this tutorial: Grid Search Explained.


We will understand Grid Search by tuning Random Forests by using iris dataset, which is inbuilt in Python. Iris dataset comprises of data for 150 flowers belonging to 3 different species: Setosa, Versicolor and Virginica. For these 150 flowers their Sepal Length, Sepal Width, Petal Length and Petal Width information is available.


Let us firstly load pandas library

import pandas as pd

Now we will load iris dataset from sklearn library

from sklearn.datasets import load_iris
iris = load_iris()

Following are the variable names in iris dataset

iris.feature_names

Output: ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']

Now we are storing the independent variables from iris dataset in X and dependent variable in y

X = iris.data
y = iris.target

We can see from the shape that X has 150 rows and 4 columns

X.shape

Output:

(150, 4)

We can see the number of occurences of different species:

pd.Series(y).value_counts()

Output:

2 50 1 50 0 50 dtype: int64


Now we are splitting the data in training set and test set. Note that we will build our model using the training set and we will use test set to check our performance of the algorithm. We are splitting our data into 80% training set and 20% test set. We can see that training set has got 120 rows and test set has 30 rows.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(X_train.shape);

print(X_test.shape);

print(y_train.shape);

print(y_test.shape)

Output:

(120, 4) (30, 4) (120,) (30,)


Building Random Forests with default parameters


Let us build our Random Forest model with default parameters. With the following code we are loading our RandomForestClassifier function from ensemble module in Python. For regression problem we would have used RandomForestRegressor.

from sklearn.ensemble import RandomForestClassifier

Defining our Random Forests model with default parameters and fitting it on training set.

rf = RandomForestClassifier()
rf.fit(X_train, y_train)

Making the predictions on test set:

pred_test = rf.predict(X_test)

We are not getting the accuracy for our test sets.

from sklearn.metrics import accuracy_score
accuracy_score(y_test, pred_test)

Output: 1.0



Grid Search code begins


The most optimal way to find optimal parameters for our Random Forests model is by using GridSearch.

Here we are defining the dictionary for various parameters which we want to tune.

In Python following 5 parameters can be used for hypertuning Random Forests:

  • n_estimators: Number of trees in the forest.

  • max_depth: The maximum depth of the tree.

  • min_samples_split: The minimum number of samples required to split a node

  • min_samples_leaf: The minimum number of samples required to be at a leaf node. Let us say min_sample_leaf = 5, thus, If after splitting a node we do not have 5 observations in the child nodes then the parent node will not split.

  • max_features: The number of features to consider while building a decision tree in the forest.


Currently we are only tuning n_estimators and max_depth:

Here we are trying different values for n_estimators as 100, 250 and 500, and max_depth as 4,8,10.

param_grid = {'n_estimators': [100,250,500],  'max_depth': [4, 8,10]}
print(param_grid)

This will lead to 9 possible combinations of parameters:

  1. n_estimators = 100, max_depth = 4

  2. n_estimators = 100, max_depth = 8

  3. n_estimators = 100, max_depth = 10

  4. n_estimators = 250, max_depth = 4

  5. n_estimators = 250, max_depth = 8

  6. n_estimators = 250, max_depth = 10

  7. n_estimators = 500, max_depth = 4

  8. n_estimators = 500, max_depth = 8

  9. n_estimators = 500, max_depth = 10

Now we are defining the model to use. random_state = 5 is used to create the model for reproducibility.

model_RF = RandomForestClassifier(random_state=5)

We are now defining the Grid Search for our model, specifying our parameter grid. CV = 5 implies 5-fold cross validation with scoring mechanism as 'accuracy'.

Note you can use any scoring mechanism, such as: 'f1' or 'precision' or 'recal' or 'roc_auc' for measuring the performance in K-fold cross validation for grid search.

model2 = GridSearchCV(estimator=model_RF, param_grid=param_grid, cv=5, scoring='accuracy', n_jobs=-1)

We are now fitting the model to our training data to obtain the best parameters

model2.fit(X_train, y_train)

The best parameters of our model are given as:

model2.best_params_

Output: {'max_depth': 4, 'n_estimators': 100}


The best 5-fold cross validation accuracy for our model is:

model2.best_score_

Output: 0.95


Note that by default n_estimators = 100 in Random Forests package for Python, thus it is not showing the value for n_estimators in this output.

model2.best_estimator_

Output: RandomForestClassifier(max_depth=4, random_state=5)


You can also look at the output of 5-fold cross validation for each of the 9 combinations using the following function:

model2.cv_results_

In the following output params refers to the different combinations of parameters tried.

split0_test_score refers to the score on 1st fold of cross validation for all the 9 parameter combinations.

split1_test_score refers to the score on 2nd fold of cross validation for all the 9 parameter combinations.

Similarly, split4_test_score refers to the score on 5th fold of cross validation for all the 9 parameter combinations.

mean_test_score gives us the average of the scores of 5 fold cross validation for all 9 combinations

std_test_score gives us the standard deviation of the scores of 5 fold cross validation for all 9 combinations

rank_test_score gives us the rank of each of the parameter combination on the basis of mean_test_score. Parameter combination with highest mean_test_score is ranked as 1.


Fitting our model using Best Estimator

Now we are fitting out final Random Forests model using our best estimator i.e.,

rf = model2.best_estimator_
rf.fit(X_train,y_train)

We are not getting the confusion matrix and accuracy for our training and test sets.

from sklearn.metrics import confusion_matrix,accuracy_score

Making the predictions on training set:

y_pred_RF_train = rf.predict(X_train)

Now we are getting the accuracy for training set:

accuracy_score(y_train, y_pred_RF_train)

Output: 0.975


Making the predictions on test set:

y_pred_RF = rf.predict(X_test)

Now we are getting the accuracy for test set:

accuracy_score(y_test, y_pred_RF)

Output: 1.0