Logistic regression sklearn in python A Beginner’s Guide

Logistic regression is a popular algorithm in machine learning, especially for binary classification tasks. It’s simple to implement, interpretable, and often performs surprisingly well even on complex tasks. In this blog, we’ll explore how to implement logistic regression using scikit-learn, a powerful Python library for machine learning.

What is Logistic Regression?

Logistic regression is a supervised learning algorithm primarily used for binary classification. It’s based on the logistic function (also called the sigmoid function), which outputs a probability between 0 and 1.

While it’s called "regression", logistic regression is used for classification tasks. Unlike linear regression, which predicts a continuous value, logistic regression predicts probabilities that a data point belongs to one of two classes (0 or 1).

The model uses the following equation to make predictions:

p(x)=11+e−(β0+β1x)p(x) = \frac{1}{1 + e^{-(\beta_0 + \beta_1x)}}p(x)=1+e−(β0+β1x)1

Where:

p(x)p(x)p(x) is the probability that the output is 1 given the input feature xxx,
β0\beta_0β0 is the intercept,
β1\beta_1β1 is the coefficient for the feature xxx.

Step-by-Step Implementation with Scikit-learn

Let’s walk through the process of implementing logistic regression in Python using Scikit-learn.

1. Setting Up Your Environment

First, make sure you have scikit-learn installed. You can do this by running:

pip install scikit-learn

2. Importing the Necessary Libraries

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix

Here, we import numpy for handling arrays, train_test_split to split the data into training and testing sets, LogisticRegression to perform the logistic regression, and accuracy_score and confusion_matrix for evaluating the model.

3. Creating a Simple Dataset

Let’s start with a simple dataset. For this example, we’ll manually create a small dataset:

X = np.array([[1, 2], [2, 1], [2, 2], [3, 3], [4, 4], [5, 5]])
y = np.array([0, 0, 0, 1, 1, 1])

Here:

X is the array of input features.
y is the array of corresponding labels (0 or 1).

4. Splitting the Data

We split our dataset into training and testing sets so that we can evaluate the model’s performance:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

This splits 67% of the data for training and the remaining 33% for testing.

5. Training the Logistic Regression Model

Now, we create and train the logistic regression model:

logreg = LogisticRegression()
logreg.fit(X_train, y_train)

In the above code, we initialize the LogisticRegression model and fit it to the training data (X_train and y_train).

6. Making Predictions

Once the model is trained, we can make predictions on the test data:

y_pred = logreg.predict(X_test)

7. Evaluating the Model

Now it’s time to evaluate how well the model has performed. We can calculate the accuracy and display the confusion matrix.

accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy}')

Confusion Matrix

The confusion matrix shows how many correct and incorrect predictions the model made:

conf_matrix = confusion_matrix(y_test, y_pred)
print(f'Confusion Matrix:\n{conf_matrix}')

8. Full Code

Here’s the full code for the logistic regression implementation:

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix

# Create dataset
X = np.array([[1, 2], [2, 1], [2, 2], [3, 3], [4, 4], [5, 5]])
y = np.array([0, 0, 0, 1, 1, 1])

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

# Initialize the logistic regression model
logreg = LogisticRegression()

# Train the model on the training data
logreg.fit(X_train, y_train)

# Make predictions on the test data
y_pred = logreg.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)

# Display results
print(f'Accuracy: {accuracy}')
print(f'Confusion Matrix:\n{conf_matrix}')

Understanding the Output

Accuracy: This is the ratio of correct predictions to total predictions. A higher accuracy means the model is performing well.
Confusion Matrix: This provides insight into the number of true positives, true negatives, false positives, and false negatives. It's especially helpful when working with imbalanced datasets.

Example Output:

Accuracy: 1.0
Confusion Matrix:
[[2 0]
 [0 1]]

This output tells us that the model correctly predicted all the test data points, resulting in an accuracy of 100%.

Conclusion

Logistic regression is an easy-to-use algorithm that can provide powerful results for classification tasks. In this blog, we walked through a simple implementation using Scikit-learn. Although we used a basic dataset, the same principles apply to more complex, real-world datasets.

You can experiment further by trying different datasets, tweaking hyperparameters, or even using multi-class classification with logistic regression!

Codeagles

{Where coding begins}