Scikit-learn Fundamentals: Classification, Regression, and Clustering

Scikit-learn is the go-to machine learning library for Python practitioners. Its clean API, comprehensive documentation, and rich collection of algorithms make it the foundation for countless machine learning projects. Whether you’re building your first model or optimizing a production system, scikit-learn provides the tools you need.

At its core, scikit-learn enables three fundamental machine learning approaches: Classification, Regression, and Clustering. Understanding when and how to use each approach is essential for solving real-world problems effectively.

Classification: Predicting Categories

What is Classification?

Classification is a supervised learning task where you predict which category or class an observation belongs to. The model learns from labeled examples where the correct category is known, then applies that knowledge to predict categories for new, unseen data.

When to Use Classification

Use classification when your target variable is categorical (discrete categories):

Email spam detection (spam vs. not spam)
Disease diagnosis (disease present vs. absent)
Customer churn prediction (will leave vs. will stay)
Image recognition (cat, dog, bird, etc.)
Credit approval (approve vs. deny)

Common Classification Algorithms in Scikit-learn

Logistic Regression: Fast, interpretable, good baseline
Decision Trees: Interpretable, handles non-linear relationships
Random Forests: Ensemble method, robust, handles complex patterns
Support Vector Machines (SVM): Powerful for high-dimensional data
Naive Bayes: Fast, works well with text data
K-Nearest Neighbors (KNN): Simple, no training phase

Classification Example

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report

# Load sample data
iris = load_iris()
X = iris.data  # Features (flower measurements)
y = iris.target  # Labels (flower species: 0, 1, or 2)

# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Create and train the classifier
classifier = RandomForestClassifier(n_estimators=100, random_state=42)
classifier.fit(X_train, y_train)

# Make predictions
y_pred = classifier.predict(X_test)

# Evaluate performance
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2%}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=iris.target_names))

Key Classification Metrics

Accuracy: Percentage of correct predictions (use when classes are balanced)
Precision: Of positive predictions, how many were correct? (important when false positives are costly)
Recall: Of actual positives, how many did we identify? (important when false negatives are costly)
F1-Score: Balanced measure combining precision and recall

Regression: Predicting Continuous Values

What is Regression?

Regression is a supervised learning task where you predict a continuous numerical value. The model learns the relationship between input features and a continuous target variable, then uses that relationship to predict values for new data.

When to Use Regression

Use regression when your target variable is continuous (numerical values):

House price prediction based on features
Stock price forecasting
Temperature prediction
Customer lifetime value estimation
Sales forecasting
Demand prediction

Common Regression Algorithms in Scikit-learn

Linear Regression: Simple, interpretable, good baseline
Ridge/Lasso Regression: Linear regression with regularization
Polynomial Regression: Captures non-linear relationships
Support Vector Regression (SVR): Powerful for complex relationships
Random Forest Regression: Ensemble method, handles non-linearity
Gradient Boosting Regression: Often produces best results

Regression Example

from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np

# Load sample data
diabetes = load_diabetes()
X = diabetes.data  # Features (patient measurements)
y = diabetes.target  # Target (disease progression)

# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Create and train the regressor
regressor = RandomForestRegressor(n_estimators=100, random_state=42)
regressor.fit(X_train, y_train)

# Make predictions
y_pred = regressor.predict(X_test)

# Evaluate performance
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)

print(f"Mean Squared Error: {mse:.2f}")
print(f"Root Mean Squared Error: {rmse:.2f}")
print(f"R² Score: {r2:.4f}")

Key Regression Metrics

Mean Squared Error (MSE): Average squared difference between predicted and actual values
Root Mean Squared Error (RMSE): Square root of MSE, in same units as target
Mean Absolute Error (MAE): Average absolute difference between predicted and actual values
R² Score: Proportion of variance explained (1.0 is perfect, 0.0 is baseline)

Clustering: Discovering Groups

What is Clustering?

Clustering is an unsupervised learning task where you group similar data points together without predefined labels. The algorithm discovers natural groupings in the data based on feature similarity.

When to Use Clustering

Use clustering when you want to discover patterns or group similar items:

Customer segmentation for targeted marketing
Document clustering for organizing content
Gene sequencing for biological research
Image compression through color grouping
Anomaly detection by identifying outlier clusters
Recommendation systems through user grouping

Common Clustering Algorithms in Scikit-learn

K-Means: Fast, simple, good for spherical clusters
Hierarchical Clustering: Produces dendrograms, good for exploratory analysis
DBSCAN: Finds clusters of arbitrary shape, handles outliers
Gaussian Mixture Models (GMM): Probabilistic approach, flexible
Spectral Clustering: Good for complex cluster shapes

Clustering Example

from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt

# Generate sample data
X, y_true = make_blobs(n_samples=300, centers=4, n_features=2, random_state=42)

# Standardize features (important for clustering)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Create and fit the clustering model
kmeans = KMeans(n_clusters=4, random_state=42, n_init=10)
cluster_labels = kmeans.fit_predict(X_scaled)

# Visualize the clusters
plt.figure(figsize=(10, 6))
plt.scatter(X_scaled[:, 0], X_scaled[:, 1], c=cluster_labels, cmap='viridis', alpha=0.6)
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1],
            c='red', marker='X', s=200, label='Centroids')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('K-Means Clustering Results')
plt.legend()
plt.show()

# Evaluate clustering quality
inertia = kmeans.inertia_  # Sum of squared distances to nearest cluster center
print(f"Inertia: {inertia:.2f}")

Key Clustering Metrics

Inertia: Sum of squared distances from points to their cluster centers (lower is better)
Silhouette Score: Measures how similar points are to their own cluster vs. other clusters (-1 to 1, higher is better)
Davies-Bouldin Index: Ratio of within-cluster to between-cluster distances (lower is better)

Choosing the Right Approach

Decision Framework

Question	Answer	Approach
Do you have labeled data?	Yes	Classification or Regression
Do you have unlabeled data?	Yes	Clustering
Is your target categorical?	Yes	Classification
Is your target continuous?	Yes	Regression
Do you want to discover groups?	Yes	Clustering

Quick Comparison

Aspect	Classification	Regression	Clustering
Learning Type	Supervised	Supervised	Unsupervised
Target Type	Categorical	Continuous	None (discovery)
Requires Labels	Yes	Yes	No
Goal	Predict category	Predict value	Find groups
Evaluation	Accuracy, Precision, Recall	RMSE, R²	Silhouette, Inertia

Best Practices with Scikit-learn

1. Always Split Your Data

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

2. Scale Your Features

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

3. Use Cross-Validation

from sklearn.model_selection import cross_val_score

scores = cross_val_score(model, X_train, y_train, cv=5)
print(f"Cross-validation scores: {scores}")
print(f"Mean score: {scores.mean():.4f} (+/- {scores.std():.4f})")

4. Tune Hyperparameters

from sklearn.model_selection import GridSearchCV

param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [5, 10, 15],
    'min_samples_split': [2, 5, 10]
}

grid_search = GridSearchCV(RandomForestClassifier(), param_grid, cv=5)
grid_search.fit(X_train, y_train)
print(f"Best parameters: {grid_search.best_params_}")

Conclusion

Scikit-learn’s three fundamental approaches—Classification, Regression, and Clustering—form the foundation of most machine learning projects:

Classification answers “Which category?” for categorical predictions
Regression answers “What value?” for continuous predictions
Clustering answers “What groups exist?” for discovering patterns

The key to successful machine learning is matching the right approach to your problem. Start by understanding your data and defining your objective clearly. Then choose the appropriate algorithm, prepare your data carefully, and evaluate your results rigorously.

Scikit-learn makes this process accessible with its consistent API and comprehensive documentation. Whether you’re a beginner building your first model or an experienced practitioner optimizing complex systems, scikit-learn provides the tools you need to succeed.