Logistic Regression

What is Logistic Regression?

In statistics, the logistic model (or logit model) is used to model the probability of a certain class or event existing, such as pass/fail, win/lose, alive/dead, or healthy/sick. This can be extended to model several classes of events, such as determining whether an image contains a cat, dog, lion, etc. Each object being detected in the image would be assigned a probability between 0 and 1, with the sum adding to one.

Despite its name containing “regression,” logistic regression is primarily a classification algorithm, not a regression algorithm in the traditional sense.

The Logistic Function (Sigmoid Function)

At the heart of logistic regression is the sigmoid function, which maps any real-valued number to a value between 0 and 1:

$$ \sigma(z) = \frac{1}{1 + e^{-z}} $$

Where $z = w^T x + b$ is a linear combination of input features $x$, weights $w$, and bias $b$.

Visualization

The sigmoid function has an S-shaped curve:

import numpy as np
import matplotlib.pyplot as plt

def sigmoid(z):
    return 1 / (1 + np.exp(-z))

z = np.linspace(-10, 10, 100)
plt.plot(z, sigmoid(z))
plt.xlabel('z')
plt.ylabel('σ(z)')
plt.title('Sigmoid Function')
plt.grid(True)
plt.show()

When $z \to \infty$, $\sigma(z) \to 1$
When $z \to -\infty$, $\sigma(z) \to 0$
When $z = 0$, $\sigma(z) = 0.5$

Binary Logistic Regression

For binary classification (two classes: 0 and 1), logistic regression predicts the probability that an input belongs to class 1:

$$ P(y=1 | x) = \sigma(w^T x + b) = \frac{1}{1 + e^{-(w^T x + b)}} $$

Decision Boundary

We classify based on a threshold (typically 0.5):

If $P(y=1 | x) \geq 0.5$, predict class 1
If $P(y=1 | x) < 0.5$, predict class 0

Cost Function

Logistic regression uses the log loss (binary cross-entropy) as its cost function:

$$ J(w, b) = -\frac{1}{m} \sum_{i=1}^{m} \left[ y^{(i)} \log(\hat{y}^{(i)}) + (1 - y^{(i)}) \log(1 - \hat{y}^{(i)}) \right] $$

Where:

$m$ is the number of training examples
$y^{(i)}$ is the true label (0 or 1)
$\hat{y}^{(i)} = \sigma(w^T x^{(i)} + b)$ is the predicted probability

Training with Gradient Descent

We minimize the cost function using gradient descent:

$$ w := w - \alpha \frac{\partial J}{\partial w} $$

$$ b := b - \alpha \frac{\partial J}{\partial b} $$

Where $\alpha$ is the learning rate.

Multi-Class Logistic Regression (Softmax Regression)

For multi-class classification (more than 2 classes), we use the softmax function:

$$ P(y=k | x) = \frac{e^{z_k}}{\sum_{j=1}^{K} e^{z_j}} $$

Where $K$ is the number of classes, and $z_k = w_k^T x + b_k$ for class $k$.

The softmax ensures all probabilities sum to 1:

$$ \sum_{k=1}^{K} P(y=k | x) = 1 $$

Example: Image Classification

For classifying images into cat, dog, or lion:

import numpy as np

# Example logits (raw model outputs)
logits = np.array([2.0, 1.0, 0.1])  # Cat, Dog, Lion

# Softmax function
def softmax(z):
    exp_z = np.exp(z - np.max(z))  # For numerical stability
    return exp_z / exp_z.sum()

probabilities = softmax(logits)
print("Cat:", probabilities[0])    # ~0.659
print("Dog:", probabilities[1])    # ~0.242
print("Lion:", probabilities[2])   # ~0.099

Implementation Example

Here’s a simple binary logistic regression from scratch:

import numpy as np

class LogisticRegression:
    def __init__(self, learning_rate=0.01, iterations=1000):
        self.lr = learning_rate
        self.iterations = iterations
        self.weights = None
        self.bias = None
    
    def sigmoid(self, z):
        return 1 / (1 + np.exp(-z))
    
    def fit(self, X, y):
        n_samples, n_features = X.shape
        self.weights = np.zeros(n_features)
        self.bias = 0
        
        for _ in range(self.iterations):
            # Forward pass
            linear_pred = np.dot(X, self.weights) + self.bias
            predictions = self.sigmoid(linear_pred)
            
            # Compute gradients
            dw = (1 / n_samples) * np.dot(X.T, (predictions - y))
            db = (1 / n_samples) * np.sum(predictions - y)
            
            # Update parameters
            self.weights -= self.lr * dw
            self.bias -= self.lr * db
    
    def predict(self, X):
        linear_pred = np.dot(X, self.weights) + self.bias
        y_pred = self.sigmoid(linear_pred)
        return [1 if i > 0.5 else 0 for i in y_pred]

# Example usage
X_train = np.array([[1, 2], [2, 3], [3, 4], [4, 5]])
y_train = np.array([0, 0, 1, 1])

model = LogisticRegression()
model.fit(X_train, y_train)
predictions = model.predict(X_train)
print(predictions)  # [0, 0, 1, 1]

Using Scikit-Learn

For practical applications, use scikit-learn:

from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

# Load data
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
    iris.data, iris.target, test_size=0.2, random_state=42
)

# Train model
model = LogisticRegression(max_iter=200)
model.fit(X_train, y_train)

# Evaluate
accuracy = model.score(X_test, y_test)
print(f"Accuracy: {accuracy:.2f}")

Advantages and Disadvantages

Advantages

Simple and interpretable
Works well for linearly separable data
Outputs probabilities, not just class labels
Less prone to overfitting with regularization

Disadvantages

Assumes linear relationship between features and log-odds
Doesn’t work well with non-linear decision boundaries (without feature engineering)
Sensitive to outliers

When to Use Logistic Regression

Binary or multi-class classification tasks
When you need probability estimates
As a baseline model before trying more complex algorithms
When interpretability is important

Key Takeaways

Logistic regression models probabilities using the sigmoid (binary) or softmax (multi-class) function.
It’s a linear classifier optimized using gradient descent.
Despite the name, it’s used for classification, not regression.
It forms the foundation for neural networks (a single-layer neural network with sigmoid activation is logistic regression).

For non-linear problems, consider using kernel methods, decision trees, or neural networks.