Logistic Regression in Python
In this tutorial we would be understanding how to implement Logistic Regression algorithm in Python.
If you wish to understand the theory behind Logistic Regression then you can refer to this tutorial by us: Logistic Regression
To run Logistic Regression in Python we will use breastcancer dataset, which is inbuilt in Python. Breast Cancer dataset comprises of data for 30 variables of 569 patients.
Let us firstly load pandas library
import pandas as pd
Now we will load breastcancer dataset from sklearn library
from sklearn.datasets import load_breast_cancer breastcancer = load_breast_cancer()
Following are the variable names in iris dataset
array(['mean radius', 'mean texture', 'mean perimeter', 'mean area', 'mean smoothness', 'mean compactness', 'mean concavity', 'mean concave points', 'mean symmetry', 'mean fractal dimension', 'radius error', 'texture error', 'perimeter error', 'area error', 'smoothness error', 'compactness error', 'concavity error', 'concave points error', 'symmetry error', 'fractal dimension error', 'worst radius', 'worst texture', 'worst perimeter', 'worst area', 'worst smoothness', 'worst compactness', 'worst concavity', 'worst concave points', 'worst symmetry', 'worst fractal dimension'], dtype='<U23')
Now we are storing the independent variables from iris dataset in X and dependent variable in y
X = breastcancer.data y = breastcancer.target
We can see from the shape that X has 569 rows and 30 columns
Output: (569, 30)
We can see the number of breast cancer cases: 1 means benign and 0 means malignant. In our dataset we have 212 malignant breast cancer cases.
1 357 0 212 dtype: int64
Now we are splitting the data in training set and test set. Note that we will build our model using the training set and we will use test set to check our performance of the algorithm. We are splitting our data into 80% training set and 20% test set. We can see that training set has got 455 rows and test set has 114 rows.
from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) print(X_train.shape); print(X_test.shape); print(y_train.shape); print(y_test.shape)
(455, 30) (114, 30) (455,) (114,)
Let us build our Logistic Regression model with default parameters. For this we are loading our library:
from sklearn.linear_model import LogisticRegression
Now we are building a model for logistic regression by specifiying max_iter = 10000 for convergence. By default max_iter = 100. When we ran with default max_iter = 100 we learnt that the model is not converging, thus max_iter was increased to 10000.
We have set random_state = 123 for reproducibility.
model = LogisticRegression( random_state=123, max_iter = 10000)
We are fitting our decision tree on training set
Using classes_ we can see that our model has 2 classes: 0 and 1.
Output: array([0, 1])
The intercept for our logistic regression model is given by intercept_
The coefficients for each of the 30 variables in our logistic regression model is given by coef_
array([[ 1.04351229, 0.21108683, -0.33594961, 0.02371376, -0.16027709, -0.22667572, -0.53982807, -0.30046715, -0.22525567, -0.03306539, -0.09364145, 1.32118517, -0.18929123, -0.08491477, -0.02440484, 0.06529172, -0.02736973, -0.03374734, -0.02956454, 0.01431152, 0.19492332, -0.49300484, -0.02222415, -0.01737195, -0.3178944 , -0.7212673 , -1.42196559, -0.53118056, -0.74041148, -0.0917267 ]])
Using predict_proba function we can predict the probability for each class. The first probability is corresponding to class 0 and second probability is for class 1.
In the code below we have fetched the predicted probabilities by logistic regression model for first 5 rows in our test set:
array([[1.28395085e-01, 8.71604915e-01], [9.99999973e-01, 2.71915121e-08], [9.98299024e-01, 1.70097572e-03], [1.43442445e-03, 9.98565576e-01], [1.76887417e-04, 9.99823113e-01]])
Using predict function we have printed the first 5 predictions in our test set. For an instance, if probability >= 0.5 then class is predicted as 1, otherwise 0.
array([1, 0, 0, 1, 1])
Now we are savingthe predictions on training and test set using predict function:
pred_train = model.predict(X_train) pred_test = model.predict(X_test)
Now we are loading the libraries to get the the confusion matrix and accuracy for our model:
from sklearn.metrics import confusion_matrix, accuracy_score
Now we are creating our confusion matrix for training set. We can see we have 14+7 = 21 misclassifications by logistic regression
array([[155, 14], [ 7, 279]], dtype=int64)
Using accuracy_score function we are getting our training set accuracy
Now, we are getting the confusion matrix for test set
array([[39, 4], [ 1, 70]], dtype=int64)
With the following code, we are visualising the confusion matrix:
import matplotlib.pyplot as plt fig, ax = plt.subplots(figsize=(8, 8)) ax.imshow(cm) ax.grid(False) ax.xaxis.set(ticks=(0, 1), ticklabels=('Predicted (0)', 'Predicted (1)')) ax.yaxis.set(ticks=(0, 1), ticklabels=('Actual (0)', 'Actual (1)')) ax.set_ylim(1.5, -0.5) for i in range(2): for j in range(2): ax.text(j, i, cm[i, j], ha='center', va='center',backgroundcolor = 'black', color='white',fontsize = 'x-large') plt.show()
The test set accuracy for our logistic regression model is 0.956