Skip to main content

Building Production ML Systems: MLOps Best Practices

Published: June 20, 2025 Updated: May 8, 2026 Larry Qu 8 min read

Introduction

Machine learning in production is vastly different from notebooks and competitions. A model that achieves 95% accuracy in a Jupyter notebook can fail silently in production due to stale data, distribution shift, or infrastructure mismatches. MLOps closes this gap by treating ML systems with the same engineering rigor as any other production software.

This guide covers the full lifecycle: data pipelines, feature engineering, experiment tracking, deployment, and monitoring.


The ML Lifecycle

Before writing a single line of model code, it helps to understand where engineers actually spend their time. The distribution below is consistent across most production teams:

Phase Typical Time Allocation
Data collection 30–40%
Data cleaning 20–25%
Exploratory analysis 10–15%
Feature engineering 10–15%
Model training 5–10%
Deployment & monitoring 5–10%

Most of the work is upstream of training. A well-designed data pipeline pays dividends across every model you’ll ever train. The architecture below shows how data flows from raw sources to model serving:

flowchart TD
    A[Raw Sources\nAPIs / DBs / Files] --> B[Ingestion\nKafka / S3 / Firehose]
    B --> C[Validation\nSchema + Quality Checks]
    C --> D[Transformation\nCleaning / Normalization]
    D --> E[Feature Store\nVersioned + Cached]
    E --> F[Training]
    E --> G[Online Serving]
    F --> H[Model Registry]
    H --> G

The Feature Store is the key architectural element here. It ensures the features used at training time are identical to those served at inference time — a mismatch between the two is one of the most common sources of silent production failures.


Data Engineering

Data Quality

Data quality issues are far cheaper to catch before training than after deployment. Define validation rules against your schema and run them as a gate in your pipeline:

  • Missing values — define fill strategy per column (mean, median, forward-fill, or sentinel)
  • Outlier detection — use IQR or z-score thresholds, log anomalies rather than silently dropping
  • Type validation — enforce dtypes at the schema level, not in model code
  • Temporal consistency — for time-series data, check for gaps and ordering
  • Distribution monitoring — baseline each feature’s distribution at training time; alert on significant divergence later
  • Duplicate detection — deduplication logic belongs in the pipeline, not the model
  • PII handling — mask or tokenize sensitive fields before they reach the feature store

Feature Store Pattern

A naive implementation computes features on the fly per request, which leads to duplicated logic, inconsistent preprocessing between training and serving, and high latency. A feature store centralises this:

class FeatureStore:
    def __init__(self, cache):
        self.cache = cache  # e.g. Redis client

    def get(self, entity_id: str, features: list[str]) -> dict:
        """Return features from cache, computing any that are missing."""
        result = {}
        for name in features:
            key = f"{entity_id}:{name}"
            value = self.cache.get(key)
            if value is None:
                value = self._compute(entity_id, name)
                self.cache.set(key, value, ex=3600)
            result[name] = value
        return result

    def _compute(self, entity_id: str, feature_name: str):
        raise NotImplementedError("Subclass with domain logic")

In practice you would use a managed feature store (Feast, Tecton, or SageMaker Feature Store) rather than rolling your own, but the pattern above illustrates the core contract: a single source of truth for features consumed by both training jobs and the serving layer.


Model Training

Experiment Tracking

Without experiment tracking, hyperparameter tuning becomes archaeology — you end up re-running experiments you’ve already done. MLflow makes every run reproducible and comparable:

import mlflow

mlflow.set_experiment("customer_churn_v2")

with mlflow.start_run():
    mlflow.log_params({"learning_rate": 0.01, "max_depth": 5, "n_estimators": 100})

    model = train(X_train, y_train)
    metrics = evaluate(model, X_val, y_val)

    mlflow.log_metrics(metrics)
    mlflow.sklearn.log_model(model, "model")

Track everything: hyperparameters, dataset version, git commit hash, and environment dependencies. A run you can’t reproduce is a run you can’t trust.

Grid search scales poorly with the number of parameters. Optuna uses Bayesian optimization to sample the search space intelligently, spending more trials near promising regions:

import optuna

def objective(trial):
    params = {
        "learning_rate": trial.suggest_float("lr", 1e-4, 1e-1, log=True),
        "max_depth": trial.suggest_int("max_depth", 3, 15),
        "n_estimators": trial.suggest_int("n_estimators", 50, 500),
    }
    model = train(X_train, y_train, **params)
    return evaluate_f1(model, X_val, y_val)

study = optuna.create_study(direction="maximize", pruner=optuna.pruners.MedianPruner())
study.optimize(objective, n_trials=50)
print(study.best_params)

Use MedianPruner to stop unpromising trials early. For deep learning, integrate with your framework’s callback system so Optuna can prune mid-epoch.

Validation Strategy

The right split strategy depends on your data. A random train/test split on time-series data leaks future information into training — always use a time-based split for temporal problems:

def time_split(df, test_ratio=0.2):
    """Preserve temporal order; no shuffling."""
    n = int(len(df) * (1 - test_ratio))
    return df.iloc[:n], df.iloc[n:]

# For classification with class imbalance, use stratification
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

Report precision, recall, and F1 alongside accuracy — accuracy alone is misleading on imbalanced datasets where a majority-class classifier scores high trivially.


Model Deployment

Serving Architecture

The choice between batch and online serving is a product decision masquerading as a technical one. Batch inference is simpler and cheaper; online serving is necessary when predictions must be fresh at request time.

flowchart LR
    subgraph Batch
        B1[Scheduler] --> B2[Inference Job]
        B2 --> B3[Results DB]
        B3 --> B4[Application]
    end

    subgraph Online
        O1[Application] --> O2[Model API]
        O2 --> O3[Feature Store]
        O3 --> O2
        O2 --> O1
    end

For online serving, containerise the model to decouple it from the deployment environment:

FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY model.pkl scaler.pkl app.py ./
EXPOSE 8080
CMD ["gunicorn", "-w", "2", "-b", "0.0.0.0:8080", "app:app"]

Note gunicorn rather than Flask’s dev server — never use a development server in production. The worker count should be 2 * CPU_cores + 1 for I/O-bound workloads.

Prediction API

A well-designed serving endpoint separates inference logic from HTTP concerns and always returns structured errors:

from flask import Flask, request, jsonify
import joblib, numpy as np

app = Flask(__name__)
model = joblib.load("model.pkl")
scaler = joblib.load("scaler.pkl")

@app.route("/predict", methods=["POST"])
def predict():
    body = request.get_json(force=True)
    if "features" not in body:
        return jsonify({"error": "missing 'features' key"}), 400

    X = scaler.transform(np.array(body["features"]).reshape(1, -1))
    proba = model.predict_proba(X)[0]
    return jsonify({"label": int(proba.argmax()), "confidence": float(proba.max())})

@app.route("/health")
def health():
    return {"status": "ok"}

Add a /health endpoint from day one — load balancers and orchestrators (Kubernetes, ECS) depend on it to determine whether to route traffic to an instance.


Monitoring

Once a model is in production, you need to know when it stops working. There are two distinct failure modes: data drift (input distribution shifts) and concept drift (the relationship between inputs and targets changes). Both are silent without instrumentation.

flowchart TD
    P[Prediction Request] --> LOG[Log Features + Output]
    LOG --> M1[Input Distribution Monitor]
    LOG --> M2[Output Distribution Monitor]
    GT[Ground Truth\nfeedback loop] --> M3[Accuracy Monitor]
    M1 & M2 & M3 --> ALERT{Threshold\nbreached?}
    ALERT -- yes --> RETRAIN[Trigger Retraining DAG]
    ALERT -- no --> OK[Continue]

Expose Prometheus metrics from your serving layer so you can build dashboards and alerting without custom infrastructure:

from prometheus_client import Counter, Histogram

PREDICTIONS = Counter("ml_predictions_total", "Total predictions", ["label"])
LATENCY     = Histogram("ml_prediction_latency_seconds", "Inference latency")

@app.route("/predict", methods=["POST"])
def predict():
    with LATENCY.time():
        # ... inference logic ...
        PREDICTIONS.labels(label=str(result["label"])).inc()
        return jsonify(result)

Track at minimum: prediction latency (p50/p95/p99), prediction distribution per class, and feature means/stddevs against training baselines. Alert on any metric that diverges beyond two standard deviations from baseline.


Retraining Pipeline

Manual retraining is a liability. Automate it as a scheduled DAG so the model stays fresh without human intervention. The pipeline should gate on validation before promoting to production — never deploy a retrained model that hasn’t beaten the current one:

flowchart LR
    S[Schedule / Drift Alert] --> D[Fetch New Data]
    D --> T[Train Candidate Model]
    T --> V{Beats baseline?}
    V -- yes --> R[Register in Model Registry]
    R --> DEPLOY[Deploy to Production]
    V -- no --> FAIL[Alert + Keep Incumbent]
# Airflow DAG skeleton — fill in task bodies with your actual logic
from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime, timedelta

with DAG("ml_retraining", schedule="0 2 * * 0",
         default_args={"retries": 2, "retry_delay": timedelta(minutes=5)},
         start_date=datetime(2025, 1, 1), catchup=False) as dag:

    fetch   = PythonOperator(task_id="fetch_data",     python_callable=fetch_new_data)
    train   = PythonOperator(task_id="train",          python_callable=train_candidate)
    compare = PythonOperator(task_id="compare",        python_callable=compare_to_baseline)
    deploy  = PythonOperator(task_id="deploy",         python_callable=promote_if_better)

    fetch >> train >> compare >> deploy

Best Practices

  • Version everything — code (git), data (DVC or S3 versioning), and models (MLflow registry). If you can’t reproduce a run, you can’t debug it.
  • Fail loudly on data issues — raise exceptions on schema violations rather than silently imputing. Silent failures produce confidently wrong predictions.
  • Separate training and serving code paths — shared preprocessing code between the two is a common source of training–serving skew.
  • Shadow deploy new models — route a copy of live traffic to the candidate model and compare outputs before cutting over fully.
  • Define a rollback plan — keep the previous model version registered so a single command can revert a bad deploy.
  • Monitor the feature store — upstream data pipeline failures often manifest as subtle feature drift rather than obvious errors.
  • Document model cards — record intended use, training data, known limitations, and fairness considerations alongside each model version.

Common Pitfalls

  • Leaking future data into training — particularly common with time-series joins, lag features, and target encoding. Always apply temporal splits before any feature computation.
  • Not matching preprocessing at serving time — fitting a StandardScaler on training data and forgetting to save and reload it at serving time. Save all preprocessing artifacts alongside the model.
  • Optimising for accuracy on imbalanced data — a model that predicts the majority class exclusively achieves high accuracy but zero utility. Use F1, AUC-ROC, or precision/recall as your primary metrics.
  • Treating monitoring as optional — models degrade silently. Without monitoring, you discover failures through customer complaints or downstream metric drops, not alerts.
  • Skipping model validation before retraining deploy — automated retraining pipelines that deploy unconditionally can ship a degraded model if training data quality drops.
  • Large batch sizes in feature computation — computing features row-by-row in a Python loop at serving time is orders of magnitude slower than vectorised batch computation. Profile your serving path early.

Resources

Comments

👍 Was this article helpful?