Logistic Regression

What is Logistic Regression?

In statistics, the logistic model (or logit model) is used to model the probability of a certain class or event existing, such as pass/fail, win/lose, alive/dead, or healthy/sick. This can be extended to model several classes of events, such as determining whether an image contains a cat, dog, lion, etc. Each object being detected in the image would be assigned a probability between 0 and 1, with the sum adding to one.

Despite its name containing “regression,” logistic regression is primarily a classification algorithm, not a regression algorithm in the traditional sense.

The Logistic Function (Sigmoid Function)

At the heart of logistic regression is the sigmoid function, which maps any real-valued number to a value between 0 and 1:

$$ \sigma(z) = \frac{1}{1 + e^{-z}} $$

Where $z = w^T x + b$ is a linear combination of input features $x$, weights $w$, and bias $b$.

Visualization

The sigmoid function has an S-shaped curve:

import numpy as np
import matplotlib.pyplot as plt

def sigmoid(z):
    return 1 / (1 + np.exp(-z))

z = np.linspace(-10, 10, 100)
plt.plot(z, sigmoid(z))
plt.xlabel('z')
plt.ylabel('σ(z)')
plt.title('Sigmoid Function')
plt.grid(True)
plt.show()

When $z \to \infty$, $\sigma(z) \to 1$
When $z \to -\infty$, $\sigma(z) \to 0$
When $z = 0$, $\sigma(z) = 0.5$

Binary Logistic Regression

For binary classification (two classes: 0 and 1), logistic regression predicts the probability that an input belongs to class 1:

$$ P(y=1 | x) = \sigma(w^T x + b) = \frac{1}{1 + e^{-(w^T x + b)}} $$

Decision Boundary

We classify based on a threshold (typically 0.5):

If $P(y=1 | x) \geq 0.5$, predict class 1
If $P(y=1 | x) < 0.5$, predict class 0

Cost Function

Logistic regression uses the log loss (binary cross-entropy) as its cost function:

$$ J(w, b) = -\frac{1}{m} \sum_{i=1}^{m} \left[ y^{(i)} \log(\hat{y}^{(i)}) + (1 - y^{(i)}) \log(1 - \hat{y}^{(i)}) \right] $$

Where:

$m$ is the number of training examples
$y^{(i)}$ is the true label (0 or 1)
$\hat{y}^{(i)} = \sigma(w^T x^{(i)} + b)$ is the predicted probability

Training with Gradient Descent

We minimize the cost function using gradient descent:

$$ w := w - \alpha \frac{\partial J}{\partial w} $$$$ b := b - \alpha \frac{\partial J}{\partial b} $$

Where $\alpha$ is the learning rate.

Regularization: L1, L2, and ElasticNet

Logistic regression without regularization can overfit, especially with high-dimensional features. Regularization adds a penalty term to the cost function:

$$ J(w) = -\frac{1}{m} \sum_{i=1}^{m} \left[ y^{(i)} \log(\hat{y}^{(i)}) + (1 - y^{(i)}) \log(1 - \hat{y}^{(i)}) \right] + \lambda \cdot R(w) $$

Regularization	Penalty $R(w)$	Effect	Best For
L1 (Lasso)	$\\|w\\|_1 = \sum \\|w_j\\|$	Sparse weights, feature selection	High-dimensional data with irrelevant features
L2 (Ridge)	$\\|w\\|_2^2 = \sum w_j^2$	Small but non-zero weights	Correlated features, default choice
ElasticNet	$\alpha \cdot L1 + (1-\alpha) \cdot L2$	Combination of both	When you need both sparsity and group selection

import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification

# Generate synthetic data with 100 features, only 5 informative
X, y = make_classification(
    n_samples=1000, n_features=100, n_informative=5,
    n_redundant=0, random_state=42
)

# Compare regularization types
models = {
    "L1 (Lasso)": LogisticRegression(penalty='l1', solver='saga', C=1.0, max_iter=1000),
    "L2 (Ridge)": LogisticRegression(penalty='l2', solver='lbfgs', C=1.0, max_iter=1000),
    "ElasticNet": LogisticRegression(penalty='elasticnet', solver='saga', C=1.0, l1_ratio=0.5, max_iter=1000),
}

for name, model in models.items():
    model.fit(X, y)
    n_nonzero = np.sum(model.coef_ != 0)
    print(f"{name}: {n_nonzero} non-zero coefficients (out of 100)")

The C parameter is the inverse of regularization strength ($C = 1/\lambda$). Smaller C values apply stronger regularization. Use cross-validation to find the optimal C.

Advanced Optimization Methods

Beyond gradient descent, scikit-learn supports several solvers:

Solver	Algorithm	Best For	Supports L1	Supports L2
`lbfgs`	Limited-memory BFGS (quasi-Newton)	Small to medium datasets	No	Yes
`liblinear`	Coordinate descent	Small datasets, binary	Yes	Yes
`saga`	Stochastic Average Gradient Augmented	Large datasets, all penalties	Yes	Yes
`newton-cg`	Newton Conjugate Gradient	Well-conditioned problems	No	Yes
`sag`	Stochastic Average Gradient	Large datasets (faster than saga)	No	Yes

from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
import time

X, y = make_classification(n_samples=10000, n_features=20, random_state=42)

solvers = ['lbfgs', 'saga', 'newton-cg', 'sag']
for solver in solvers:
    start = time.time()
    model = LogisticRegression(solver=solver, max_iter=500, random_state=42)
    model.fit(X, y)
    elapsed = time.time() - start
    print(f"{solver:12s}: {elapsed:.3f}s, accuracy={model.score(X, y):.3f}")

For most practical applications, lbfgs is the default recommendation. For large datasets (100K+ samples), saga provides the best balance of speed and flexibility.

Multi-Class Logistic Regression (Softmax Regression)

For multi-class classification (more than 2 classes), we use the softmax function:

$$ P(y=k | x) = \frac{e^{z_k}}{\sum_{j=1}^{K} e^{z_j}} $$

Where $K$ is the number of classes, and $z_k = w_k^T x + b_k$ for class $k$.

The softmax ensures all probabilities sum to 1:

$$ \sum_{k=1}^{K} P(y=k | x) = 1 $$

Example: Image Classification

For classifying images into cat, dog, or lion:

import numpy as np

# Example logits (raw model outputs)
logits = np.array([2.0, 1.0, 0.1])  # Cat, Dog, Lion

# Softmax function
def softmax(z):
    exp_z = np.exp(z - np.max(z))  # For numerical stability
    return exp_z / exp_z.sum()

probabilities = softmax(logits)
print("Cat:", probabilities[0])    # ~0.659
print("Dog:", probabilities[1])    # ~0.242
print("Lion:", probabilities[2])   # ~0.099

Implementation Example

Here’s a simple binary logistic regression from scratch:

import numpy as np

class LogisticRegression:
    def __init__(self, learning_rate=0.01, iterations=1000):
        self.lr = learning_rate
        self.iterations = iterations
        self.weights = None
        self.bias = None
    
    def sigmoid(self, z):
        return 1 / (1 + np.exp(-z))
    
    def fit(self, X, y):
        n_samples, n_features = X.shape
        self.weights = np.zeros(n_features)
        self.bias = 0
        
        for _ in range(self.iterations):
            # Forward pass
            linear_pred = np.dot(X, self.weights) + self.bias
            predictions = self.sigmoid(linear_pred)
            
            # Compute gradients
            dw = (1 / n_samples) * np.dot(X.T, (predictions - y))
            db = (1 / n_samples) * np.sum(predictions - y)
            
            # Update parameters
            self.weights -= self.lr * dw
            self.bias -= self.lr * db
    
    def predict(self, X):
        linear_pred = np.dot(X, self.weights) + self.bias
        y_pred = self.sigmoid(linear_pred)
        return [1 if i > 0.5 else 0 for i in y_pred]

# Example usage
X_train = np.array([[1, 2], [2, 3], [3, 4], [4, 5]])
y_train = np.array([0, 0, 1, 1])

model = LogisticRegression()
model.fit(X_train, y_train)
predictions = model.predict(X_train)
print(predictions)  # [0, 0, 1, 1]

Using Scikit-Learn

For practical applications, use scikit-learn:

from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

# Load data
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
    iris.data, iris.target, test_size=0.2, random_state=42
)

# Train model
model = LogisticRegression(max_iter=200)
model.fit(X_train, y_train)

# Evaluate
accuracy = model.score(X_test, y_test)
print(f"Accuracy: {accuracy:.2f}")

Evaluation Metrics

Accuracy alone is insufficient, especially for imbalanced datasets. Use these comprehensive metrics:

Confusion Matrix and Derived Metrics

from sklearn.metrics import (
    confusion_matrix, classification_report,
    roc_curve, roc_auc_score, precision_recall_curve,
    log_loss
)
import matplotlib.pyplot as plt

# Generate predictions
y_pred = model.predict(X_test)
y_proba = model.predict_proba(X_test)[:, 1]

# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(f"TN: {cm[0,0]}  FP: {cm[0,1]}")
print(f"FN: {cm[1,0]}  TP: {cm[1,1]}")

# Detailed classification report
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

# ROC-AUC
fpr, tpr, thresholds = roc_curve(y_test, y_proba)
auc_score = roc_auc_score(y_test, y_proba)
print(f"ROC-AUC: {auc_score:.3f}")

# Log loss (cross-entropy loss)
loss = log_loss(y_test, y_proba)
print(f"Log Loss: {loss:.3f}")

# Plot ROC curve
plt.figure()
plt.plot(fpr, tpr, label=f'ROC curve (AUC = {auc_score:.2f})')
plt.plot([0, 1], [0, 1], 'k--', label='Random classifier')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend()

Metric Interpretation Guide

Metric	Range	Perfect Value	Interpretation
Accuracy	0-1	1.0	Overall correct predictions (misleading if imbalanced)
Precision	0-1	1.0	Of positive predictions, how many were correct
Recall (Sensitivity)	0-1	1.0	Of actual positives, how many were found
F1 Score	0-1	1.0	Harmonic mean of precision and recall
ROC-AUC	0.5-1	1.0	Probability a positive ranks higher than a negative
Log Loss	0-∞	0	Negative log-likelihood (lower is better)
Matthews Corr. Coeff.	-1 to 1	1.0	Balanced measure for imbalanced classes

Threshold Tuning

The default threshold of 0.5 is not always optimal. Use precision-recall curves to find the best threshold:

def find_optimal_threshold(y_true, y_scores):
    """Find threshold that maximizes F1 score."""
    precisions, recalls, thresholds = precision_recall_curve(y_true, y_scores)
    f1_scores = 2 * (precisions * recalls) / (precisions + recalls + 1e-10)
    optimal_idx = np.argmax(f1_scores[:-1])  # Exclude last element
    return thresholds[optimal_idx]

optimal_threshold = find_optimal_threshold(y_test, y_proba)
print(f"Optimal threshold: {optimal_threshold:.3f}")

# Apply custom threshold
y_pred_custom = (y_proba >= optimal_threshold).astype(int)
print(classification_report(y_test, y_pred_custom))

Feature Engineering for Logistic Regression

Logistic regression is a linear model — its performance depends heavily on feature quality. These techniques extend its reach to non-linear problems:

Polynomial Features

from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline

# Add interaction and polynomial terms
poly = PolynomialFeatures(degree=2, interaction_only=False, include_bias=False)
X_poly = poly.fit_transform(X_train)
print(f"Original features: {X_train.shape[1]}")
print(f"With polynomial features (deg=2): {X_poly.shape[1]}")

# Pipeline with polynomial features
pipeline = Pipeline([
    ('poly', PolynomialFeatures(degree=2, include_bias=False)),
    ('scaler', StandardScaler()),
    ('classifier', LogisticRegression(C=1.0, max_iter=1000))
])
pipeline.fit(X_train, y_train)
print(f"Polynomial model accuracy: {pipeline.score(X_test, y_test):.3f}")

Feature Scaling

Logistic regression with L1/L2 regularization requires feature scaling. Features on different scales cause the regularization to penalize coefficients unevenly:

from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.pipeline import Pipeline

# Standard scaling (z-score)
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', LogisticRegression(C=1.0, max_iter=1000))
])
pipeline.fit(X_train, y_train)

Encoding Categorical Variables

import pandas as pd
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder
from sklearn.compose import ColumnTransformer

# Sample data with mixed types
data = pd.DataFrame({
    'age': [25, 35, 45, 55],
    'income': [50000, 75000, 100000, 120000],
    'education': ['high_school', 'bachelor', 'master', 'phd'],
    'city': ['NYC', 'SF', 'NYC', 'CHI'],
    'target': [0, 1, 1, 0]
})

# Preprocessing for mixed types
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), ['age', 'income']),
        ('edu', OrdinalEncoder(categories=[['high_school', 'bachelor', 'master', 'phd']]), ['education']),
        ('city', OneHotEncoder(drop='first'), ['city']),
    ]
)

pipeline = Pipeline([
    ('preprocess', preprocessor),
    ('classifier', LogisticRegression(C=1.0))
])

X = data.drop('target', axis=1)
y = data['target']
pipeline.fit(X, y)

Interaction Terms

Logistic regression assumes additive feature effects. Add interaction terms to model feature dependencies:

from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.pipeline import Pipeline

# Add interaction features
interactions = PolynomialFeatures(degree=2, interaction_only=True, include_bias=False)
# interaction_only=True creates only products (x1*x2), not squares (x1²)

pipeline = Pipeline([
    ('interactions', interactions),
    ('scaler', StandardScaler()),
    ('classifier', LogisticRegression(C=1.0))
])

Handling Imbalanced Data

Logistic regression assumes balanced classes. For imbalanced datasets, use these strategies:

Class Weight Adjustment

# Automatically adjust weights inversely proportional to class frequencies
model = LogisticRegression(class_weight='balanced', max_iter=1000)
model.fit(X_train, y_train)

# Manual weight specification
model = LogisticRegression(class_weight={0: 1.0, 1: 5.0}, max_iter=1000)

Sampling Techniques

from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from imblearn.pipeline import Pipeline as ImbPipeline

# SMOTE oversampling
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)
print(f"Before SMOTE: {np.bincount(y_train)}")
print(f"After SMOTE: {np.bincount(y_resampled)}")

# Combined pipeline with SMOTE
pipeline = ImbPipeline([
    ('scaler', StandardScaler()),
    ('sampling', SMOTE(random_state=42)),
    ('classifier', LogisticRegression(C=1.0, max_iter=1000))
])
pipeline.fit(X_train, y_train)

Threshold Moving for Imbalanced Data

from sklearn.metrics import precision_recall_curve

def optimize_threshold_for_imbalanced(y_true, y_scores, min_recall=0.8):
    """Find the highest precision threshold while maintaining minimum recall."""
    precisions, recalls, thresholds = precision_recall_curve(y_true, y_scores)
    valid = recalls >= min_recall
    if not any(valid):
        return 0.5
    best_idx = np.argmax(precisions[:-1][valid])
    return thresholds[valid][best_idx]

threshold = optimize_threshold_for_imbalanced(y_test, y_proba, min_recall=0.8)
y_pred_balanced = (y_proba >= threshold).astype(int)
print(f"Using threshold={threshold:.3f}")
print(classification_report(y_test, y_pred_balanced))

Real-World Case Study: Credit Risk Classification

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, roc_auc_score
from sklearn.pipeline import Pipeline

# Simulated credit risk dataset
np.random.seed(42)
n_samples = 10000

data = pd.DataFrame({
    'credit_score': np.random.normal(650, 100, n_samples),
    'income': np.random.lognormal(mean=11, sigma=0.5, n_samples),
    'debt_to_income': np.random.beta(2, 5, n_samples),
    'age': np.random.randint(22, 70, n_samples),
    'employment_length': np.random.randint(0, 40, n_samples),
    'loan_amount': np.random.lognormal(mean=10, sigma=1, n_samples),
})

# Simulate default probability (lower credit score + higher DTI = higher risk)
log_odds = (
    -0.01 * data['credit_score']
    + 0.00001 * data['income']
    + 2.0 * data['debt_to_income']
    + 0.001 * data['loan_amount']
    + np.random.normal(0, 1, n_samples)
)
data['default'] = (1 / (1 + np.exp(-log_odds)) > 0.5).astype(int)
print(f"Default rate: {data['default'].mean():.1%}")

# Build and evaluate model
X = data.drop('default', axis=1)
y = data['default']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', LogisticRegression(class_weight='balanced', C=0.1, max_iter=1000))
])

# Cross-validation
cv_scores = cross_val_score(pipeline, X_train, y_train, cv=5, scoring='roc_auc')
print(f"Cross-val AUC: {cv_scores.mean():.3f} (+/- {cv_scores.std():.3f})")

# Final evaluation
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
y_proba = pipeline.predict_proba(X_test)[:, 1]

print(f"\nTest ROC-AUC: {roc_auc_score(y_test, y_proba):.3f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

# Interpret coefficients
coef_df = pd.DataFrame({
    'feature': X.columns,
    'coefficient': pipeline.named_steps['classifier'].coef_[0]
})
coef_df['odds_ratio'] = np.exp(coef_df['coefficient'])
print("\nModel Coefficients (Odds Ratios):")
print(coef_df.sort_values('odds_ratio', ascending=False))

Advantages and Disadvantages

Advantages

Simple and interpretable
Works well for linearly separable data
Outputs probabilities, not just class labels
Less prone to overfitting with regularization

Disadvantages

Assumes linear relationship between features and log-odds
Doesn’t work well with non-linear decision boundaries (without feature engineering)
Sensitive to outliers

When to Use Logistic Regression

Binary or multi-class classification tasks
When you need probability estimates
As a baseline model before trying more complex algorithms
When interpretability is important

Key Takeaways

Logistic regression models probabilities using the sigmoid (binary) or softmax (multi-class) function.
It’s a linear classifier optimized using gradient descent.
Despite the name, it’s used for classification, not regression.
It forms the foundation for neural networks (a single-layer neural network with sigmoid activation is logistic regression).

For non-linear problems, consider using kernel methods, decision trees, or neural networks.

Logistic Regression

What is Logistic Regression?

The Logistic Function (Sigmoid Function)

Visualization

Binary Logistic Regression

Decision Boundary

Cost Function

Training with Gradient Descent

Regularization: L1, L2, and ElasticNet

Advanced Optimization Methods

Multi-Class Logistic Regression (Softmax Regression)

Example: Image Classification

Implementation Example

Using Scikit-Learn

Evaluation Metrics

Confusion Matrix and Derived Metrics

Metric Interpretation Guide

Threshold Tuning

Feature Engineering for Logistic Regression

Polynomial Features

Feature Scaling

Encoding Categorical Variables

Interaction Terms

Handling Imbalanced Data

Class Weight Adjustment

Sampling Techniques

Threshold Moving for Imbalanced Data

Real-World Case Study: Credit Risk Classification

Advantages and Disadvantages

Advantages

Disadvantages

When to Use Logistic Regression

Key Takeaways

Resources

Comments

Share this article

👍 Was this article helpful?