Logistic Regression

Modeling Binary and Multi-Class Probabilities

What is Logistic Regression?

In statistics, the logistic model (or logit model) is used to model the probability of a certain class or event existing, such as pass/fail, win/lose, alive/dead, or healthy/sick. This can be extended to model several classes of events, such as determining whether an image contains a cat, dog, lion, etc. Each object being detected in the image would be assigned a probability between 0 and 1, with the sum adding to one.

Despite its name containing “regression,” logistic regression is primarily a classification algorithm, not a regression algorithm in the traditional sense.

The Logistic Function (Sigmoid Function)

At the heart of logistic regression is the sigmoid function, which maps any real-valued number to a value between 0 and 1:

$$ \sigma(z) = \frac{1}{1 + e^{-z}} $$

Where $z = w^T x + b$ is a linear combination of input features $x$, weights $w$, and bias $b$.

Visualization

The sigmoid function has an S-shaped curve:

import numpy as np
import matplotlib.pyplot as plt

def sigmoid(z):
    return 1 / (1 + np.exp(-z))

z = np.linspace(-10, 10, 100)
plt.plot(z, sigmoid(z))
plt.xlabel('z')
plt.ylabel('σ(z)')
plt.title('Sigmoid Function')
plt.grid(True)
plt.show()
  • When $z \to \infty$, $\sigma(z) \to 1$
  • When $z \to -\infty$, $\sigma(z) \to 0$
  • When $z = 0$, $\sigma(z) = 0.5$

Binary Logistic Regression

For binary classification (two classes: 0 and 1), logistic regression predicts the probability that an input belongs to class 1:

$$ P(y=1 | x) = \sigma(w^T x + b) = \frac{1}{1 + e^{-(w^T x + b)}} $$

Decision Boundary

We classify based on a threshold (typically 0.5):

  • If $P(y=1 | x) \geq 0.5$, predict class 1
  • If $P(y=1 | x) < 0.5$, predict class 0

Cost Function

Logistic regression uses the log loss (binary cross-entropy) as its cost function:

$$ J(w, b) = -\frac{1}{m} \sum_{i=1}^{m} \left[ y^{(i)} \log(\hat{y}^{(i)}) + (1 - y^{(i)}) \log(1 - \hat{y}^{(i)}) \right] $$

Where:

  • $m$ is the number of training examples
  • $y^{(i)}$ is the true label (0 or 1)
  • $\hat{y}^{(i)} = \sigma(w^T x^{(i)} + b)$ is the predicted probability

Training with Gradient Descent

We minimize the cost function using gradient descent:

$$ w := w - \alpha \frac{\partial J}{\partial w} $$

$$ b := b - \alpha \frac{\partial J}{\partial b} $$

Where $\alpha$ is the learning rate.

Multi-Class Logistic Regression (Softmax Regression)

For multi-class classification (more than 2 classes), we use the softmax function:

$$ P(y=k | x) = \frac{e^{z_k}}{\sum_{j=1}^{K} e^{z_j}} $$

Where $K$ is the number of classes, and $z_k = w_k^T x + b_k$ for class $k$.

The softmax ensures all probabilities sum to 1:

$$ \sum_{k=1}^{K} P(y=k | x) = 1 $$

Example: Image Classification

For classifying images into cat, dog, or lion:

import numpy as np

# Example logits (raw model outputs)
logits = np.array([2.0, 1.0, 0.1])  # Cat, Dog, Lion

# Softmax function
def softmax(z):
    exp_z = np.exp(z - np.max(z))  # For numerical stability
    return exp_z / exp_z.sum()

probabilities = softmax(logits)
print("Cat:", probabilities[0])    # ~0.659
print("Dog:", probabilities[1])    # ~0.242
print("Lion:", probabilities[2])   # ~0.099

Implementation Example

Here’s a simple binary logistic regression from scratch:

import numpy as np

class LogisticRegression:
    def __init__(self, learning_rate=0.01, iterations=1000):
        self.lr = learning_rate
        self.iterations = iterations
        self.weights = None
        self.bias = None
    
    def sigmoid(self, z):
        return 1 / (1 + np.exp(-z))
    
    def fit(self, X, y):
        n_samples, n_features = X.shape
        self.weights = np.zeros(n_features)
        self.bias = 0
        
        for _ in range(self.iterations):
            # Forward pass
            linear_pred = np.dot(X, self.weights) + self.bias
            predictions = self.sigmoid(linear_pred)
            
            # Compute gradients
            dw = (1 / n_samples) * np.dot(X.T, (predictions - y))
            db = (1 / n_samples) * np.sum(predictions - y)
            
            # Update parameters
            self.weights -= self.lr * dw
            self.bias -= self.lr * db
    
    def predict(self, X):
        linear_pred = np.dot(X, self.weights) + self.bias
        y_pred = self.sigmoid(linear_pred)
        return [1 if i > 0.5 else 0 for i in y_pred]

# Example usage
X_train = np.array([[1, 2], [2, 3], [3, 4], [4, 5]])
y_train = np.array([0, 0, 1, 1])

model = LogisticRegression()
model.fit(X_train, y_train)
predictions = model.predict(X_train)
print(predictions)  # [0, 0, 1, 1]

Using Scikit-Learn

For practical applications, use scikit-learn:

from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

# Load data
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
    iris.data, iris.target, test_size=0.2, random_state=42
)

# Train model
model = LogisticRegression(max_iter=200)
model.fit(X_train, y_train)

# Evaluate
accuracy = model.score(X_test, y_test)
print(f"Accuracy: {accuracy:.2f}")

Advantages and Disadvantages

Advantages

  • Simple and interpretable
  • Works well for linearly separable data
  • Outputs probabilities, not just class labels
  • Less prone to overfitting with regularization

Disadvantages

  • Assumes linear relationship between features and log-odds
  • Doesn’t work well with non-linear decision boundaries (without feature engineering)
  • Sensitive to outliers

When to Use Logistic Regression

  • Binary or multi-class classification tasks
  • When you need probability estimates
  • As a baseline model before trying more complex algorithms
  • When interpretability is important

Key Takeaways

  • Logistic regression models probabilities using the sigmoid (binary) or softmax (multi-class) function.
  • It’s a linear classifier optimized using gradient descent.
  • Despite the name, it’s used for classification, not regression.
  • It forms the foundation for neural networks (a single-layer neural network with sigmoid activation is logistic regression).

For non-linear problems, consider using kernel methods, decision trees, or neural networks.