• Ekta Aggarwal

Logistic Regression in Python

In this tutorial we would be understanding how to implement Logistic Regression algorithm in Python.

If you wish to understand the theory behind Logistic Regression then you can refer to this tutorial by us: Logistic Regression

To run Logistic Regression in Python we will use breastcancer dataset, which is inbuilt in Python. Breast Cancer dataset comprises of data for 30 variables of 569 patients.

Let us firstly load pandas library

import pandas as pd

Now we will load breastcancer dataset from sklearn library

from sklearn.datasets import load_breast_cancer
breastcancer = load_breast_cancer()

Following are the variable names in iris dataset



array(['mean radius', 'mean texture', 'mean perimeter', 'mean area',
       'mean smoothness', 'mean compactness', 'mean concavity',
       'mean concave points', 'mean symmetry', 'mean fractal dimension',
       'radius error', 'texture error', 'perimeter error', 'area error',
       'smoothness error', 'compactness error', 'concavity error',
       'concave points error', 'symmetry error',
       'fractal dimension error', 'worst radius', 'worst texture',
       'worst perimeter', 'worst area', 'worst smoothness',
       'worst compactness', 'worst concavity', 'worst concave points',
       'worst symmetry', 'worst fractal dimension'], dtype='<U23')

Now we are storing the independent variables from iris dataset in X and dependent variable in y

X = breastcancer.data
y = breastcancer.target

We can see from the shape that X has 569 rows and 30 columns


Output: (569, 30)

We can see the number of breast cancer cases: 1 means benign and 0 means malignant. In our dataset we have 212 malignant breast cancer cases.



1    357
0    212
dtype: int64

Now we are splitting the data in training set and test set. Note that we will build our model using the training set and we will use test set to check our performance of the algorithm. We are splitting our data into 80% training set and 20% test set. We can see that training set has got 455 rows and test set has 114 rows.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)





(455, 30) (114, 30) (455,) (114,)

Let us build our Logistic Regression model with default parameters. For this we are loading our library:

from sklearn.linear_model import LogisticRegression

Now we are building a model for logistic regression by specifiying max_iter = 10000 for convergence. By default max_iter = 100. When we ran with default max_iter = 100 we learnt that the model is not converging, thus max_iter was increased to 10000.

We have set random_state = 123 for reproducibility.

model = LogisticRegression( random_state=123, max_iter = 10000)

We are fitting our decision tree on training set


Using classes_ we can see that our model has 2 classes: 0 and 1.


Output: array([0, 1])

The intercept for our logistic regression model is given by intercept_


Output: array([26.65648096])

The coefficients for each of the 30 variables in our logistic regression model is given by coef_



array([[ 1.04351229,  0.21108683, -0.33594961,  0.02371376, -0.16027709,
        -0.22667572, -0.53982807, -0.30046715, -0.22525567, -0.03306539,
        -0.09364145,  1.32118517, -0.18929123, -0.08491477, -0.02440484,
         0.06529172, -0.02736973, -0.03374734, -0.02956454,  0.01431152,
         0.19492332, -0.49300484, -0.02222415, -0.01737195, -0.3178944 ,
        -0.7212673 , -1.42196559, -0.53118056, -0.74041148, -0.0917267 ]])

Using predict_proba function we can predict the probability for each class. The first probability is corresponding to class 0 and second probability is for class 1.

In the code below we have fetched the predicted probabilities by logistic regression model for first 5 rows in our test set:



array([[1.28395085e-01, 8.71604915e-01],
       [9.99999973e-01, 2.71915121e-08],
       [9.98299024e-01, 1.70097572e-03],
       [1.43442445e-03, 9.98565576e-01],
       [1.76887417e-04, 9.99823113e-01]])

Using predict function we have printed the first 5 predictions in our test set. For an instance, if probability >= 0.5 then class is predicted as 1, otherwise 0.



array([1, 0, 0, 1, 1])

Now we are savingthe predictions on training and test set using predict function:

pred_train = model.predict(X_train)
pred_test = model.predict(X_test)

Now we are loading the libraries to get the the confusion matrix and accuracy for our model:

from sklearn.metrics import confusion_matrix, accuracy_score

Now we are creating our confusion matrix for training set. We can see we have 14+7 = 21 misclassifications by logistic regression

confusion_matrix(y_train, pred_train)


array([[155, 14], [ 7, 279]], dtype=int64)

Using accuracy_score function we are getting our training set accuracy


Output: 0.9538461538461539

Now, we are getting the confusion matrix for test set



array([[39, 4], [ 1, 70]], dtype=int64)

With the following code, we are visualising the confusion matrix:

cm =confusion_matrix(y_test,pred_test)
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(8, 8))
ax.xaxis.set(ticks=(0, 1), ticklabels=('Predicted (0)', 'Predicted (1)'))
ax.yaxis.set(ticks=(0, 1), ticklabels=('Actual (0)', 'Actual (1)'))
ax.set_ylim(1.5, -0.5)
for i in range(2):
    for j in range(2):
        ax.text(j, i, cm[i, j], ha='center', va='center',backgroundcolor = 'black', color='white',fontsize = 'x-large')

The test set accuracy for our logistic regression model is 0.956


Output: 0.956140350877193