Scikit-learn Fundamentals: Classification, Regression, and Clustering
Scikit-learn is the go-to machine learning library for Python practitioners. Its clean API, comprehensive documentation, and rich collection of algorithms make it the foundation for countless machine learning projects. Whether you’re building your first model or optimizing a production system, scikit-learn provides the tools you need.
At its core, scikit-learn enables three fundamental machine learning approaches: Classification, Regression, and Clustering. Understanding when and how to use each approach is essential for solving real-world problems effectively.
Classification: Predicting Categories
What is Classification?
Classification is a supervised learning task where you predict which category or class an observation belongs to. The model learns from labeled examples where the correct category is known, then applies that knowledge to predict categories for new, unseen data.
When to Use Classification
Use classification when your target variable is categorical (discrete categories):
- Email spam detection (spam vs. not spam)
- Disease diagnosis (disease present vs. absent)
- Customer churn prediction (will leave vs. will stay)
- Image recognition (cat, dog, bird, etc.)
- Credit approval (approve vs. deny)
Common Classification Algorithms in Scikit-learn
- Logistic Regression: Fast, interpretable, good baseline
- Decision Trees: Interpretable, handles non-linear relationships
- Random Forests: Ensemble method, robust, handles complex patterns
- Support Vector Machines (SVM): Powerful for high-dimensional data
- Naive Bayes: Fast, works well with text data
- K-Nearest Neighbors (KNN): Simple, no training phase
Classification Example
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
# Load sample data
iris = load_iris()
X = iris.data # Features (flower measurements)
y = iris.target # Labels (flower species: 0, 1, or 2)
# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Create and train the classifier
classifier = RandomForestClassifier(n_estimators=100, random_state=42)
classifier.fit(X_train, y_train)
# Make predictions
y_pred = classifier.predict(X_test)
# Evaluate performance
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2%}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=iris.target_names))
Key Classification Metrics
- Accuracy: Percentage of correct predictions (use when classes are balanced)
- Precision: Of positive predictions, how many were correct? (important when false positives are costly)
- Recall: Of actual positives, how many did we identify? (important when false negatives are costly)
- F1-Score: Balanced measure combining precision and recall
Regression: Predicting Continuous Values
What is Regression?
Regression is a supervised learning task where you predict a continuous numerical value. The model learns the relationship between input features and a continuous target variable, then uses that relationship to predict values for new data.
When to Use Regression
Use regression when your target variable is continuous (numerical values):
- House price prediction based on features
- Stock price forecasting
- Temperature prediction
- Customer lifetime value estimation
- Sales forecasting
- Demand prediction
Common Regression Algorithms in Scikit-learn
- Linear Regression: Simple, interpretable, good baseline
- Ridge/Lasso Regression: Linear regression with regularization
- Polynomial Regression: Captures non-linear relationships
- Support Vector Regression (SVR): Powerful for complex relationships
- Random Forest Regression: Ensemble method, handles non-linearity
- Gradient Boosting Regression: Often produces best results
Regression Example
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np
# Load sample data
diabetes = load_diabetes()
X = diabetes.data # Features (patient measurements)
y = diabetes.target # Target (disease progression)
# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Create and train the regressor
regressor = RandomForestRegressor(n_estimators=100, random_state=42)
regressor.fit(X_train, y_train)
# Make predictions
y_pred = regressor.predict(X_test)
# Evaluate performance
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)
print(f"Mean Squared Error: {mse:.2f}")
print(f"Root Mean Squared Error: {rmse:.2f}")
print(f"Rยฒ Score: {r2:.4f}")
Key Regression Metrics
- Mean Squared Error (MSE): Average squared difference between predicted and actual values
- Root Mean Squared Error (RMSE): Square root of MSE, in same units as target
- Mean Absolute Error (MAE): Average absolute difference between predicted and actual values
- Rยฒ Score: Proportion of variance explained (1.0 is perfect, 0.0 is baseline)
Clustering: Discovering Groups
What is Clustering?
Clustering is an unsupervised learning task where you group similar data points together without predefined labels. The algorithm discovers natural groupings in the data based on feature similarity.
When to Use Clustering
Use clustering when you want to discover patterns or group similar items:
- Customer segmentation for targeted marketing
- Document clustering for organizing content
- Gene sequencing for biological research
- Image compression through color grouping
- Anomaly detection by identifying outlier clusters
- Recommendation systems through user grouping
Common Clustering Algorithms in Scikit-learn
- K-Means: Fast, simple, good for spherical clusters
- Hierarchical Clustering: Produces dendrograms, good for exploratory analysis
- DBSCAN: Finds clusters of arbitrary shape, handles outliers
- Gaussian Mixture Models (GMM): Probabilistic approach, flexible
- Spectral Clustering: Good for complex cluster shapes
Clustering Example
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
# Generate sample data
X, y_true = make_blobs(n_samples=300, centers=4, n_features=2, random_state=42)
# Standardize features (important for clustering)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Create and fit the clustering model
kmeans = KMeans(n_clusters=4, random_state=42, n_init=10)
cluster_labels = kmeans.fit_predict(X_scaled)
# Visualize the clusters
plt.figure(figsize=(10, 6))
plt.scatter(X_scaled[:, 0], X_scaled[:, 1], c=cluster_labels, cmap='viridis', alpha=0.6)
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1],
c='red', marker='X', s=200, label='Centroids')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('K-Means Clustering Results')
plt.legend()
plt.show()
# Evaluate clustering quality
inertia = kmeans.inertia_ # Sum of squared distances to nearest cluster center
print(f"Inertia: {inertia:.2f}")
Key Clustering Metrics
- Inertia: Sum of squared distances from points to their cluster centers (lower is better)
- Silhouette Score: Measures how similar points are to their own cluster vs. other clusters (-1 to 1, higher is better)
- Davies-Bouldin Index: Ratio of within-cluster to between-cluster distances (lower is better)
Choosing the Right Approach
Decision Framework
| Question | Answer | Approach |
|---|---|---|
| Do you have labeled data? | Yes | Classification or Regression |
| Do you have unlabeled data? | Yes | Clustering |
| Is your target categorical? | Yes | Classification |
| Is your target continuous? | Yes | Regression |
| Do you want to discover groups? | Yes | Clustering |
Quick Comparison
| Aspect | Classification | Regression | Clustering |
|---|---|---|---|
| Learning Type | Supervised | Supervised | Unsupervised |
| Target Type | Categorical | Continuous | None (discovery) |
| Requires Labels | Yes | Yes | No |
| Goal | Predict category | Predict value | Find groups |
| Evaluation | Accuracy, Precision, Recall | RMSE, Rยฒ | Silhouette, Inertia |
Best Practices with Scikit-learn
1. Always Split Your Data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
2. Scale Your Features
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
3. Use Cross-Validation
from sklearn.model_selection import cross_val_score
scores = cross_val_score(model, X_train, y_train, cv=5)
print(f"Cross-validation scores: {scores}")
print(f"Mean score: {scores.mean():.4f} (+/- {scores.std():.4f})")
4. Tune Hyperparameters
from sklearn.model_selection import GridSearchCV
param_grid = {
'n_estimators': [50, 100, 200],
'max_depth': [5, 10, 15],
'min_samples_split': [2, 5, 10]
}
grid_search = GridSearchCV(RandomForestClassifier(), param_grid, cv=5)
grid_search.fit(X_train, y_train)
print(f"Best parameters: {grid_search.best_params_}")
Conclusion
Scikit-learn’s three fundamental approachesโClassification, Regression, and Clusteringโform the foundation of most machine learning projects:
- Classification answers “Which category?” for categorical predictions
- Regression answers “What value?” for continuous predictions
- Clustering answers “What groups exist?” for discovering patterns
The key to successful machine learning is matching the right approach to your problem. Start by understanding your data and defining your objective clearly. Then choose the appropriate algorithm, prepare your data carefully, and evaluate your results rigorously.
Scikit-learn makes this process accessible with its consistent API and comprehensive documentation. Whether you’re a beginner building your first model or an experienced practitioner optimizing complex systems, scikit-learn provides the tools you need to succeed.
Comments