Ekta Aggarwal

# Grid Search in Python

Grid Search is used to find the best parameters for a model. In this tutorial we would be understanding how to implement Grid Search in Python.

If you wish to understand the theory behind Grid Search then you can refer to this tutorial: __Grid Search Explained.__

We will understand Grid Search by tuning Random Forests by using ** iris **dataset, which is inbuilt in Python. Iris dataset comprises of data for 150 flowers belonging to 3 different species: Setosa, Versicolor and Virginica. For these 150 flowers their Sepal Length, Sepal Width, Petal Length and Petal Width information is available.

Let us firstly load pandas library

`import pandas as pd`

Now we will load iris dataset from sklearn library

```
from sklearn.datasets import load_iris
iris = load_iris()
```

Following are the variable names in iris dataset

`iris.feature_names`

__Output:____
__['sepal length (cm)',
'sepal width (cm)',
'petal length (cm)',
'petal width (cm)']

Now we are storing the independent variables from iris dataset in X and dependent variable in y

```
X = iris.data
y = iris.target
```

We can see from the shape that X has 150 rows and 4 columns

`X.shape`

__Output:__

(150, 4)

We can see the number of occurences of different species:

`pd.Series(y).value_counts()`

__Output:__

2 50 1 50 0 50 dtype: int64

Now we are splitting the data in training set and test set. Note that we will build our model using the training set and we will use test set to check our performance of the algorithm. We are splitting our data into 80% training set and 20% test set. We can see that training set has got 120 rows and test set has 30 rows.

```
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print(X_train.shape);
print(X_test.shape);
print(y_train.shape);
print(y_test.shape)
```

__Output:__

(120, 4) (30, 4) (120,) (30,)

**Building Random Forests with default parameters**

Let us build our Random Forest model with default parameters. With the following code we are loading our **RandomForestClassifier **function from **ensemble **module in Python. For regression problem we would have used **RandomForestRegressor.**

`from sklearn.ensemble import RandomForestClassifier`

Defining our Random Forests model with default parameters and fitting it on training set.

```
rf = RandomForestClassifier()
rf.fit(X_train, y_train)
```

Making the predictions on test set:

`pred_test = rf.predict(X_test)`

We are not getting the accuracy for our test sets.

```
from sklearn.metrics import accuracy_score
accuracy_score(y_test, pred_test)
```

**Output: **1.0

**Grid Search code begins**

The most optimal way to find optimal parameters for our Random Forests model is by using GridSearch.

Here we are defining the dictionary for various parameters which we want to tune.

In Python following 5 parameters can be used for hypertuning Random Forests:

Number of trees in the forest.**n_estimators:**The maximum depth of the tree.**max_depth:**The minimum number of samples required to split a node**min_samples_split:**The minimum number of samples required to be at a leaf node. Let us say min_sample_leaf = 5, thus, If after splitting a node we do not have 5 observations in the child nodes then the parent node will not split.**min_samples_leaf:**The number of features to consider while building a decision tree in the forest.**max_features:**

Currently we are only tuning *n_estimators* and *max_depth*:

Here we are trying different values for n_estimators as 100, 250 and 500, and max_depth as 4,8,10.

```
param_grid = {'n_estimators': [100,250,500], 'max_depth': [4, 8,10]}
print(param_grid)
```

This will lead to 9 possible combinations of parameters:

n_estimators = 100, max_depth = 4

n_estimators = 100, max_depth = 8

n_estimators = 100, max_depth = 10

n_estimators = 250, max_depth = 4

n_estimators = 250, max_depth = 8

n_estimators = 250, max_depth = 10

n_estimators = 500, max_depth = 4

n_estimators = 500, max_depth = 8

n_estimators = 500, max_depth = 10

Now we are defining the model to use. random_state = 5 is used to create the model for reproducibility.

`model_RF = RandomForestClassifier(random_state=5)`

We are now defining the Grid Search for our model, specifying our parameter grid. CV = 5 implies 5-fold cross validation with scoring mechanism as '*accuracy*'.

__Note you can use any scoring mechanism, such as: 'f1' or 'precision' or 'recal' or 'roc_auc' for measuring the performance in K-fold cross validation for grid search.__

`model2 = GridSearchCV(estimator=model_RF, param_grid=param_grid, cv=5, scoring='accuracy', n_jobs=-1)`

We are now fitting the model to our training data to obtain the best parameters

`model2.fit(X_train, y_train)`

The best parameters of our model are given as:

`model2.best_params_`

**Output:** {'max_depth': 4, 'n_estimators': 100}

The best 5-fold cross validation accuracy for our model is:

`model2.best_score_`

**Output: **0.95

Note that by default n_estimators = 100 in Random Forests package for Python, thus it is not showing the value for n_estimators in this output.

`model2.best_estimator_`

**Output:** RandomForestClassifier(max_depth=4, random_state=5)

You can also look at the output of 5-fold cross validation for each of the 9 combinations using the following function:

`model2.cv_results_`

In the following output * params *refers to the different combinations of parameters tried.

** split0_test_score **refers to the score on 1st fold of cross validation for all the 9 parameter combinations.

** split1_test_score **refers to the score on 2nd fold of cross validation for all the 9 parameter combinations.

Similarly, ** split4_test_score **refers to the score on 5th fold of cross validation for all the 9 parameter combinations.

mean_test_score gives us the average of the scores of 5 fold cross validation for all 9 combinations

** std_test_score **gives us the standard deviation of the scores of 5 fold cross validation for all 9 combinations

** rank_test_score **gives us the rank of each of the parameter combination on the basis of

**. Parameter combination with highest mean_test_score is ranked as 1.**

*mean_test_score***Fitting our model using Best Estimator**

Now we are fitting out final Random Forests model using our best estimator i.e.,

```
rf = model2.best_estimator_
rf.fit(X_train,y_train)
```

We are not getting the confusion matrix and accuracy for our training and test sets.

`from sklearn.metrics import confusion_matrix,accuracy_score`

**Making the predictions on training set:**

`y_pred_RF_train = rf.predict(X_train)`

Now we are getting the accuracy for training set:

`accuracy_score(y_train, y_pred_RF_train)`

**Output:** 0.975

**Making the predictions on test set:**

`y_pred_RF = rf.predict(X_test)`

Now we are getting the accuracy for test set:

`accuracy_score(y_test, y_pred_RF)`

**Output: **1.0