- Ekta Aggarwal

# Logistic Regression in Python

In this tutorial we would be understanding how to implement Logistic Regression algorithm in Python.

If you wish to understand the theory behind Logistic Regression then you can refer to this tutorial by us: __Logistic Regression__

To run Logistic Regression in Python we will use ** breastcancer **dataset, which is inbuilt in Python. Breast Cancer dataset comprises of data for 30 variables of 569 patients.

Let us firstly load pandas library

`import pandas as pd`

Now we will load breastcancer dataset from sklearn library

```
from sklearn.datasets import load_breast_cancer
breastcancer = load_breast_cancer()
```

Following are the variable names in iris dataset

`breastcancer.feature_names`

__Output:__

```
array(['mean radius', 'mean texture', 'mean perimeter', 'mean area',
'mean smoothness', 'mean compactness', 'mean concavity',
'mean concave points', 'mean symmetry', 'mean fractal dimension',
'radius error', 'texture error', 'perimeter error', 'area error',
'smoothness error', 'compactness error', 'concavity error',
'concave points error', 'symmetry error',
'fractal dimension error', 'worst radius', 'worst texture',
'worst perimeter', 'worst area', 'worst smoothness',
'worst compactness', 'worst concavity', 'worst concave points',
'worst symmetry', 'worst fractal dimension'], dtype='<U23')
```

Now we are storing the independent variables from iris dataset in X and dependent variable in y

```
X = breastcancer.data
y = breastcancer.target
```

We can see from the shape that X has 569 rows and 30 columns

`X.shape`

** Output:** (569, 30)

We can see the number of breast cancer cases: 1 means benign and 0 means malignant. In our dataset we have 212 malignant breast cancer cases.

`pd.Series(y).value_counts()`

**Output:**

```
1 357
0 212
dtype: int64
```

Now we are splitting the data in training set and test set. Note that we will build our model using the training set and we will use test set to check our performance of the algorithm. We are splitting our data into 80% training set and 20% test set. We can see that training set has got 455 rows and test set has 114 rows.

```
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print(X_train.shape);
print(X_test.shape);
print(y_train.shape);
print(y_test.shape)
```

__Output:__

(455, 30) (114, 30) (455,) (114,)

Let us build our Logistic Regression model with default parameters. For this we are loading our library:

`from sklearn.linear_model import LogisticRegression`

Now we are building a model for logistic regression by specifiying max_iter = 10000 for convergence. By default max_iter = 100. When we ran with default max_iter = 100 we learnt that the model is not converging, thus max_iter was increased to 10000.

We have set random_state = 123 for reproducibility.

`model = LogisticRegression( random_state=123, max_iter = 10000)`

We are fitting our decision tree on training set

`model.fit(X_train,y_train)`

Using classes_ we can see that our model has 2 classes: 0 and 1.

`model.classes_`

** Output: **array([0, 1])

The intercept for our logistic regression model is given by intercept_

`model.intercept_`

** Output: **array([26.65648096])

The coefficients for each of the 30 variables in our logistic regression model is given by coef_

`model.coef_`

__Output:__

```
array([[ 1.04351229, 0.21108683, -0.33594961, 0.02371376, -0.16027709,
-0.22667572, -0.53982807, -0.30046715, -0.22525567, -0.03306539,
-0.09364145, 1.32118517, -0.18929123, -0.08491477, -0.02440484,
0.06529172, -0.02736973, -0.03374734, -0.02956454, 0.01431152,
0.19492332, -0.49300484, -0.02222415, -0.01737195, -0.3178944 ,
-0.7212673 , -1.42196559, -0.53118056, -0.74041148, -0.0917267 ]])
```

Using **predict_proba** function we can predict the probability for each class. The first probability is corresponding to class 0 and second probability is for class 1.

In the code below we have fetched the predicted probabilities by logistic regression model for first 5 rows in our test set:

`model.predict_proba(X_test)[:5]`

__Output:__

```
array([[1.28395085e-01, 8.71604915e-01],
[9.99999973e-01, 2.71915121e-08],
[9.98299024e-01, 1.70097572e-03],
[1.43442445e-03, 9.98565576e-01],
[1.76887417e-04, 9.99823113e-01]])
```

Using **predict** function we have printed the first 5 predictions in our test set. For an instance, if probability >= 0.5 then class is predicted as 1, otherwise 0.

`model.predict(X_test)[:5]`

__Output:__

`array([1, 0, 0, 1, 1])`

Now we are savingthe predictions on training and test set using **predict **function:

```
pred_train = model.predict(X_train)
pred_test = model.predict(X_test)
```

Now we are loading the libraries to get the the confusion matrix and accuracy for our model:

`from sklearn.metrics import confusion_matrix, accuracy_score`

Now we are creating our confusion matrix for training set. We can see we have 14+7 = 21 misclassifications by logistic regression

`confusion_matrix(y_train, pred_train)`

__Output:__

array([[155, 14], [ 7, 279]], dtype=int64)

Using accuracy_score function we are getting our training set accuracy

`accuracy_score(y_train,pred_train)`

** Output: **0.9538461538461539

Now, we are getting the confusion matrix for test set

`confusion_matrix(y_test,pred_test)`

__Output:__

array([[39, 4], [ 1, 70]], dtype=int64)

With the following code, we are visualising the confusion matrix:

`cm =confusion_matrix(y_test,pred_test)`

```
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(8, 8))
ax.imshow(cm)
ax.grid(False)
ax.xaxis.set(ticks=(0, 1), ticklabels=('Predicted (0)', 'Predicted (1)'))
ax.yaxis.set(ticks=(0, 1), ticklabels=('Actual (0)', 'Actual (1)'))
ax.set_ylim(1.5, -0.5)
for i in range(2):
for j in range(2):
ax.text(j, i, cm[i, j], ha='center', va='center',backgroundcolor = 'black', color='white',fontsize = 'x-large')
plt.show()
```

The test set accuracy for our logistic regression model is 0.956

`accuracy_score(y_test,pred_test)`

** Output:** 0.956140350877193