Skip to main content

Logistic Regression

Modeling Binary and Multi-Class Probabilities

Published: January 3, 2019 Updated: May 24, 2026 Larry Qu 10 min read

What is Logistic Regression?

In statistics, the logistic model (or logit model) is used to model the probability of a certain class or event existing, such as pass/fail, win/lose, alive/dead, or healthy/sick. This can be extended to model several classes of events, such as determining whether an image contains a cat, dog, lion, etc. Each object being detected in the image would be assigned a probability between 0 and 1, with the sum adding to one.

Despite its name containing “regression,” logistic regression is primarily a classification algorithm, not a regression algorithm in the traditional sense.

The Logistic Function (Sigmoid Function)

At the heart of logistic regression is the sigmoid function, which maps any real-valued number to a value between 0 and 1:

$$ \sigma(z) = \frac{1}{1 + e^{-z}} $$

Where $z = w^T x + b$ is a linear combination of input features $x$, weights $w$, and bias $b$.

Visualization

The sigmoid function has an S-shaped curve:

import numpy as np
import matplotlib.pyplot as plt

def sigmoid(z):
    return 1 / (1 + np.exp(-z))

z = np.linspace(-10, 10, 100)
plt.plot(z, sigmoid(z))
plt.xlabel('z')
plt.ylabel('σ(z)')
plt.title('Sigmoid Function')
plt.grid(True)
plt.show()
  • When $z \to \infty$, $\sigma(z) \to 1$
  • When $z \to -\infty$, $\sigma(z) \to 0$
  • When $z = 0$, $\sigma(z) = 0.5$

Binary Logistic Regression

For binary classification (two classes: 0 and 1), logistic regression predicts the probability that an input belongs to class 1:

$$ P(y=1 | x) = \sigma(w^T x + b) = \frac{1}{1 + e^{-(w^T x + b)}} $$

Decision Boundary

We classify based on a threshold (typically 0.5):

  • If $P(y=1 | x) \geq 0.5$, predict class 1
  • If $P(y=1 | x) < 0.5$, predict class 0

Cost Function

Logistic regression uses the log loss (binary cross-entropy) as its cost function:

$$ J(w, b) = -\frac{1}{m} \sum_{i=1}^{m} \left[ y^{(i)} \log(\hat{y}^{(i)}) + (1 - y^{(i)}) \log(1 - \hat{y}^{(i)}) \right] $$

Where:

  • $m$ is the number of training examples
  • $y^{(i)}$ is the true label (0 or 1)
  • $\hat{y}^{(i)} = \sigma(w^T x^{(i)} + b)$ is the predicted probability

Training with Gradient Descent

We minimize the cost function using gradient descent:

$$ w := w - \alpha \frac{\partial J}{\partial w} $$$$ b := b - \alpha \frac{\partial J}{\partial b} $$

Where $\alpha$ is the learning rate.

Regularization: L1, L2, and ElasticNet

Logistic regression without regularization can overfit, especially with high-dimensional features. Regularization adds a penalty term to the cost function:

$$ J(w) = -\frac{1}{m} \sum_{i=1}^{m} \left[ y^{(i)} \log(\hat{y}^{(i)}) + (1 - y^{(i)}) \log(1 - \hat{y}^{(i)}) \right] + \lambda \cdot R(w) $$
Regularization Penalty $R(w)$ Effect Best For
L1 (Lasso) $\|w\|_1 = \sum \|w_j\|$ Sparse weights, feature selection High-dimensional data with irrelevant features
L2 (Ridge) $\|w\|_2^2 = \sum w_j^2$ Small but non-zero weights Correlated features, default choice
ElasticNet $\alpha \cdot L1 + (1-\alpha) \cdot L2$ Combination of both When you need both sparsity and group selection
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification

# Generate synthetic data with 100 features, only 5 informative
X, y = make_classification(
    n_samples=1000, n_features=100, n_informative=5,
    n_redundant=0, random_state=42
)

# Compare regularization types
models = {
    "L1 (Lasso)": LogisticRegression(penalty='l1', solver='saga', C=1.0, max_iter=1000),
    "L2 (Ridge)": LogisticRegression(penalty='l2', solver='lbfgs', C=1.0, max_iter=1000),
    "ElasticNet": LogisticRegression(penalty='elasticnet', solver='saga', C=1.0, l1_ratio=0.5, max_iter=1000),
}

for name, model in models.items():
    model.fit(X, y)
    n_nonzero = np.sum(model.coef_ != 0)
    print(f"{name}: {n_nonzero} non-zero coefficients (out of 100)")

The C parameter is the inverse of regularization strength ($C = 1/\lambda$). Smaller C values apply stronger regularization. Use cross-validation to find the optimal C.

Advanced Optimization Methods

Beyond gradient descent, scikit-learn supports several solvers:

Solver Algorithm Best For Supports L1 Supports L2
lbfgs Limited-memory BFGS (quasi-Newton) Small to medium datasets No Yes
liblinear Coordinate descent Small datasets, binary Yes Yes
saga Stochastic Average Gradient Augmented Large datasets, all penalties Yes Yes
newton-cg Newton Conjugate Gradient Well-conditioned problems No Yes
sag Stochastic Average Gradient Large datasets (faster than saga) No Yes
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
import time

X, y = make_classification(n_samples=10000, n_features=20, random_state=42)

solvers = ['lbfgs', 'saga', 'newton-cg', 'sag']
for solver in solvers:
    start = time.time()
    model = LogisticRegression(solver=solver, max_iter=500, random_state=42)
    model.fit(X, y)
    elapsed = time.time() - start
    print(f"{solver:12s}: {elapsed:.3f}s, accuracy={model.score(X, y):.3f}")

For most practical applications, lbfgs is the default recommendation. For large datasets (100K+ samples), saga provides the best balance of speed and flexibility.

Multi-Class Logistic Regression (Softmax Regression)

For multi-class classification (more than 2 classes), we use the softmax function:

$$ P(y=k | x) = \frac{e^{z_k}}{\sum_{j=1}^{K} e^{z_j}} $$

Where $K$ is the number of classes, and $z_k = w_k^T x + b_k$ for class $k$.

The softmax ensures all probabilities sum to 1:

$$ \sum_{k=1}^{K} P(y=k | x) = 1 $$

Example: Image Classification

For classifying images into cat, dog, or lion:

import numpy as np

# Example logits (raw model outputs)
logits = np.array([2.0, 1.0, 0.1])  # Cat, Dog, Lion

# Softmax function
def softmax(z):
    exp_z = np.exp(z - np.max(z))  # For numerical stability
    return exp_z / exp_z.sum()

probabilities = softmax(logits)
print("Cat:", probabilities[0])    # ~0.659
print("Dog:", probabilities[1])    # ~0.242
print("Lion:", probabilities[2])   # ~0.099

Implementation Example

Here’s a simple binary logistic regression from scratch:

import numpy as np

class LogisticRegression:
    def __init__(self, learning_rate=0.01, iterations=1000):
        self.lr = learning_rate
        self.iterations = iterations
        self.weights = None
        self.bias = None
    
    def sigmoid(self, z):
        return 1 / (1 + np.exp(-z))
    
    def fit(self, X, y):
        n_samples, n_features = X.shape
        self.weights = np.zeros(n_features)
        self.bias = 0
        
        for _ in range(self.iterations):
            # Forward pass
            linear_pred = np.dot(X, self.weights) + self.bias
            predictions = self.sigmoid(linear_pred)
            
            # Compute gradients
            dw = (1 / n_samples) * np.dot(X.T, (predictions - y))
            db = (1 / n_samples) * np.sum(predictions - y)
            
            # Update parameters
            self.weights -= self.lr * dw
            self.bias -= self.lr * db
    
    def predict(self, X):
        linear_pred = np.dot(X, self.weights) + self.bias
        y_pred = self.sigmoid(linear_pred)
        return [1 if i > 0.5 else 0 for i in y_pred]

# Example usage
X_train = np.array([[1, 2], [2, 3], [3, 4], [4, 5]])
y_train = np.array([0, 0, 1, 1])

model = LogisticRegression()
model.fit(X_train, y_train)
predictions = model.predict(X_train)
print(predictions)  # [0, 0, 1, 1]

Using Scikit-Learn

For practical applications, use scikit-learn:

from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

# Load data
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
    iris.data, iris.target, test_size=0.2, random_state=42
)

# Train model
model = LogisticRegression(max_iter=200)
model.fit(X_train, y_train)

# Evaluate
accuracy = model.score(X_test, y_test)
print(f"Accuracy: {accuracy:.2f}")

Evaluation Metrics

Accuracy alone is insufficient, especially for imbalanced datasets. Use these comprehensive metrics:

Confusion Matrix and Derived Metrics

from sklearn.metrics import (
    confusion_matrix, classification_report,
    roc_curve, roc_auc_score, precision_recall_curve,
    log_loss
)
import matplotlib.pyplot as plt

# Generate predictions
y_pred = model.predict(X_test)
y_proba = model.predict_proba(X_test)[:, 1]

# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(f"TN: {cm[0,0]}  FP: {cm[0,1]}")
print(f"FN: {cm[1,0]}  TP: {cm[1,1]}")

# Detailed classification report
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

# ROC-AUC
fpr, tpr, thresholds = roc_curve(y_test, y_proba)
auc_score = roc_auc_score(y_test, y_proba)
print(f"ROC-AUC: {auc_score:.3f}")

# Log loss (cross-entropy loss)
loss = log_loss(y_test, y_proba)
print(f"Log Loss: {loss:.3f}")

# Plot ROC curve
plt.figure()
plt.plot(fpr, tpr, label=f'ROC curve (AUC = {auc_score:.2f})')
plt.plot([0, 1], [0, 1], 'k--', label='Random classifier')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend()

Metric Interpretation Guide

Metric Range Perfect Value Interpretation
Accuracy 0-1 1.0 Overall correct predictions (misleading if imbalanced)
Precision 0-1 1.0 Of positive predictions, how many were correct
Recall (Sensitivity) 0-1 1.0 Of actual positives, how many were found
F1 Score 0-1 1.0 Harmonic mean of precision and recall
ROC-AUC 0.5-1 1.0 Probability a positive ranks higher than a negative
Log Loss 0-∞ 0 Negative log-likelihood (lower is better)
Matthews Corr. Coeff. -1 to 1 1.0 Balanced measure for imbalanced classes

Threshold Tuning

The default threshold of 0.5 is not always optimal. Use precision-recall curves to find the best threshold:

def find_optimal_threshold(y_true, y_scores):
    """Find threshold that maximizes F1 score."""
    precisions, recalls, thresholds = precision_recall_curve(y_true, y_scores)
    f1_scores = 2 * (precisions * recalls) / (precisions + recalls + 1e-10)
    optimal_idx = np.argmax(f1_scores[:-1])  # Exclude last element
    return thresholds[optimal_idx]

optimal_threshold = find_optimal_threshold(y_test, y_proba)
print(f"Optimal threshold: {optimal_threshold:.3f}")

# Apply custom threshold
y_pred_custom = (y_proba >= optimal_threshold).astype(int)
print(classification_report(y_test, y_pred_custom))

Feature Engineering for Logistic Regression

Logistic regression is a linear model — its performance depends heavily on feature quality. These techniques extend its reach to non-linear problems:

Polynomial Features

from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline

# Add interaction and polynomial terms
poly = PolynomialFeatures(degree=2, interaction_only=False, include_bias=False)
X_poly = poly.fit_transform(X_train)
print(f"Original features: {X_train.shape[1]}")
print(f"With polynomial features (deg=2): {X_poly.shape[1]}")

# Pipeline with polynomial features
pipeline = Pipeline([
    ('poly', PolynomialFeatures(degree=2, include_bias=False)),
    ('scaler', StandardScaler()),
    ('classifier', LogisticRegression(C=1.0, max_iter=1000))
])
pipeline.fit(X_train, y_train)
print(f"Polynomial model accuracy: {pipeline.score(X_test, y_test):.3f}")

Feature Scaling

Logistic regression with L1/L2 regularization requires feature scaling. Features on different scales cause the regularization to penalize coefficients unevenly:

from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.pipeline import Pipeline

# Standard scaling (z-score)
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', LogisticRegression(C=1.0, max_iter=1000))
])
pipeline.fit(X_train, y_train)

Encoding Categorical Variables

import pandas as pd
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder
from sklearn.compose import ColumnTransformer

# Sample data with mixed types
data = pd.DataFrame({
    'age': [25, 35, 45, 55],
    'income': [50000, 75000, 100000, 120000],
    'education': ['high_school', 'bachelor', 'master', 'phd'],
    'city': ['NYC', 'SF', 'NYC', 'CHI'],
    'target': [0, 1, 1, 0]
})

# Preprocessing for mixed types
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), ['age', 'income']),
        ('edu', OrdinalEncoder(categories=[['high_school', 'bachelor', 'master', 'phd']]), ['education']),
        ('city', OneHotEncoder(drop='first'), ['city']),
    ]
)

pipeline = Pipeline([
    ('preprocess', preprocessor),
    ('classifier', LogisticRegression(C=1.0))
])

X = data.drop('target', axis=1)
y = data['target']
pipeline.fit(X, y)

Interaction Terms

Logistic regression assumes additive feature effects. Add interaction terms to model feature dependencies:

from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.pipeline import Pipeline

# Add interaction features
interactions = PolynomialFeatures(degree=2, interaction_only=True, include_bias=False)
# interaction_only=True creates only products (x1*x2), not squares (x1²)

pipeline = Pipeline([
    ('interactions', interactions),
    ('scaler', StandardScaler()),
    ('classifier', LogisticRegression(C=1.0))
])

Handling Imbalanced Data

Logistic regression assumes balanced classes. For imbalanced datasets, use these strategies:

Class Weight Adjustment

# Automatically adjust weights inversely proportional to class frequencies
model = LogisticRegression(class_weight='balanced', max_iter=1000)
model.fit(X_train, y_train)

# Manual weight specification
model = LogisticRegression(class_weight={0: 1.0, 1: 5.0}, max_iter=1000)

Sampling Techniques

from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from imblearn.pipeline import Pipeline as ImbPipeline

# SMOTE oversampling
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)
print(f"Before SMOTE: {np.bincount(y_train)}")
print(f"After SMOTE: {np.bincount(y_resampled)}")

# Combined pipeline with SMOTE
pipeline = ImbPipeline([
    ('scaler', StandardScaler()),
    ('sampling', SMOTE(random_state=42)),
    ('classifier', LogisticRegression(C=1.0, max_iter=1000))
])
pipeline.fit(X_train, y_train)

Threshold Moving for Imbalanced Data

from sklearn.metrics import precision_recall_curve

def optimize_threshold_for_imbalanced(y_true, y_scores, min_recall=0.8):
    """Find the highest precision threshold while maintaining minimum recall."""
    precisions, recalls, thresholds = precision_recall_curve(y_true, y_scores)
    valid = recalls >= min_recall
    if not any(valid):
        return 0.5
    best_idx = np.argmax(precisions[:-1][valid])
    return thresholds[valid][best_idx]

threshold = optimize_threshold_for_imbalanced(y_test, y_proba, min_recall=0.8)
y_pred_balanced = (y_proba >= threshold).astype(int)
print(f"Using threshold={threshold:.3f}")
print(classification_report(y_test, y_pred_balanced))

Real-World Case Study: Credit Risk Classification

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, roc_auc_score
from sklearn.pipeline import Pipeline

# Simulated credit risk dataset
np.random.seed(42)
n_samples = 10000

data = pd.DataFrame({
    'credit_score': np.random.normal(650, 100, n_samples),
    'income': np.random.lognormal(mean=11, sigma=0.5, n_samples),
    'debt_to_income': np.random.beta(2, 5, n_samples),
    'age': np.random.randint(22, 70, n_samples),
    'employment_length': np.random.randint(0, 40, n_samples),
    'loan_amount': np.random.lognormal(mean=10, sigma=1, n_samples),
})

# Simulate default probability (lower credit score + higher DTI = higher risk)
log_odds = (
    -0.01 * data['credit_score']
    + 0.00001 * data['income']
    + 2.0 * data['debt_to_income']
    + 0.001 * data['loan_amount']
    + np.random.normal(0, 1, n_samples)
)
data['default'] = (1 / (1 + np.exp(-log_odds)) > 0.5).astype(int)
print(f"Default rate: {data['default'].mean():.1%}")

# Build and evaluate model
X = data.drop('default', axis=1)
y = data['default']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', LogisticRegression(class_weight='balanced', C=0.1, max_iter=1000))
])

# Cross-validation
cv_scores = cross_val_score(pipeline, X_train, y_train, cv=5, scoring='roc_auc')
print(f"Cross-val AUC: {cv_scores.mean():.3f} (+/- {cv_scores.std():.3f})")

# Final evaluation
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
y_proba = pipeline.predict_proba(X_test)[:, 1]

print(f"\nTest ROC-AUC: {roc_auc_score(y_test, y_proba):.3f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

# Interpret coefficients
coef_df = pd.DataFrame({
    'feature': X.columns,
    'coefficient': pipeline.named_steps['classifier'].coef_[0]
})
coef_df['odds_ratio'] = np.exp(coef_df['coefficient'])
print("\nModel Coefficients (Odds Ratios):")
print(coef_df.sort_values('odds_ratio', ascending=False))

Advantages and Disadvantages

Advantages

  • Simple and interpretable
  • Works well for linearly separable data
  • Outputs probabilities, not just class labels
  • Less prone to overfitting with regularization

Disadvantages

  • Assumes linear relationship between features and log-odds
  • Doesn’t work well with non-linear decision boundaries (without feature engineering)
  • Sensitive to outliers

When to Use Logistic Regression

  • Binary or multi-class classification tasks
  • When you need probability estimates
  • As a baseline model before trying more complex algorithms
  • When interpretability is important

Key Takeaways

  • Logistic regression models probabilities using the sigmoid (binary) or softmax (multi-class) function.
  • It’s a linear classifier optimized using gradient descent.
  • Despite the name, it’s used for classification, not regression.
  • It forms the foundation for neural networks (a single-layer neural network with sigmoid activation is logistic regression).

For non-linear problems, consider using kernel methods, decision trees, or neural networks.

Resources

Comments

👍 Was this article helpful?