Skip to main content
โšก Calmops

MLflow vs Kubeflow vs Weights & Biases: MLOps Platform Comparison

Introduction

As machine learning projects scale from experiments to production, teams need robust MLOps platforms to manage the entire lifecycle. Three leading tools have emerged: MLflow, Kubeflow, and Weights & Biases (W&B). Each serves different needs and comes with distinct trade-offs.

This guide compares these platforms across experiment tracking, model registry, pipeline orchestration, and deployment capabilities to help you choose the right tool for your workflow.

What is MLflow?

MLflow is an open-source MLOps platform developed by Databricks that focuses on the ML lifecycle management. It provides four core components:

  • Experiment Tracking: Log parameters, metrics, and artifacts
  • Model Registry: Centralized model versioning and staging
  • Model Serving: Built-in serving infrastructure
  • MLflow Projects: Packaging ML code in a reproducible format

MLflow Best Practices

import mlflow
from mlflow.tracking import MlflowClient

# Set up MLflow tracking
mlflow.set_tracking_uri("http://localhost:5000")
mlflow.set_experiment("credit-scoring-model")

# Start an experiment run
with mlflow.start_run(run_name="xgboost-v2-optimized"):
    # Log parameters
    mlflow.log_param("learning_rate", 0.05)
    mlflow.log_param("max_depth", 6)
    mlflow.log_param("n_estimators", 500)
    
    # Train your model
    model = train_xgboost_model(X_train, y_train)
    
    # Log metrics
    mlflow.log_metric("accuracy", accuracy_score(y_test, model.predict(X_test)))
    mlflow.log_metric("f1_score", f1_score(y_test, model.predict(X_test)))
    mlflow.log_metric("auc_roc", roc_auc_score(y_test, model.predict_proba(X_test)))
    
    # Log model
    mlflow.xgboost.log_model(model, "model")
    
    # Log artifacts
    mlflow.log_artifact("feature_importance.png")
    mlflow.log_artifact("confusion_matrix.png")

Using MLflow Model Registry

from mlflow.tracking import MlflowClient

client = MlflowClient()

# Register a model from a run
model_uri = "runs:/<run_id>/model"
model_name = "credit-scoring-production"
model_version = mlflow.register_model(model_uri, model_name)

# Transition model through stages
client.transition_model_version_stage(
    name=model_name,
    version=model_version.version,
    stage="Staging"
)

# Add model description
client.set_model_version_tag(
    name=model_name,
    version=model_version.version,
    key="description",
    value="XGBoost model trained on 2025-12 dataset"
)

What is Kubeflow?

Kubeflow is an open-source toolkit for running ML workloads on Kubernetes. It provides a comprehensive stack for the entire ML lifecycle:

  • Kubeflow Pipelines: Orchestrate complex ML workflows
  • Katib: Hyperparameter tuning and neural architecture search
  • KFServing: Model serving with serverless inference
  • Training Operators: Distributed training on Kubernetes

Kubeflow Pipeline Example

# pipeline.yaml
apiVersion: kubeflow.org/v1alpha2
kind: Pipeline
metadata:
  name: ml-training-pipeline
spec:
  pipelineSpec:
    tasks:
      - name: preprocess-data
        container:
          image: my-preprocess:latest
          command: ["python", "preprocess.py"]
          args: ["--input", "gs://bucket/raw data.csv", 
                 "--output", "{{$.outputs.artifacts.preprocessed.path}}"]
        outputs:
          artifacts:
            - name: preprocessed
              path: /data/preprocessed.csv

      - name: train-model
        container:
          image: my-trainer:latest
          command: ["python", "train.py"]
          args: ["--data", "{{$.inputs.artifacts.preprocessed.path}}",
                 "--model", "{{$.outputs.artifacts.model.path}}"]
        inputs:
          artifacts:
            - name: preprocessed
              task: preprocess-data
              path: /data/input.csv
        outputs:
          artifacts:
            - name: model
              path: /model/model.pkl

      - name: evaluate-model
        container:
          image: my-evaluator:latest
          command: ["python", "evaluate.py"]
          args: ["--model", "{{$.inputs.artifacts.model.path}}",
                 "--metrics", "{{$.outputs.artifacts.metrics.path}}"]
        inputs:
          artifacts:
            - name: model
              task: train-model
              path: /model/input.pkl

Kubeflow Python SDK

from kfp import dsl
from kfp.components import create_component_from_func

@create_component_from_func
def preprocess_op(input_path: str, output_path: str):
    import pandas as pd
    df = pd.read_csv(input_path)
    df_clean = df.dropna().fillna(0)
    df_clean.to_csv(output_path, index=False)

@create_component_from_func
def train_op(data_path: str, model_path: str):
    import joblib
    from sklearn.ensemble import RandomForestClassifier
    import pandas as pd
    
    df = pd.read_csv(data_path)
    X = df.drop('target', axis=1)
    y = df['target']
    
    model = RandomForestClassifier(n_estimators=100)
    model.fit(X, y)
    
    joblib.dump(model, model_path)

@create_component_from_func
def evaluate_op(model_path: str, data_path: str) -> str:
    import joblib
    import pandas as pd
    from sklearn.metrics import accuracy_score
    
    model = joblib.load(model_path)
    df = pd.read_csv(data_path)
    X = df.drop('target', axis=1)
    y = df['target']
    
    accuracy = accuracy_score(y, model.predict(X))
    return json.dumps({"accuracy": accuracy})

@dsl.pipeline(name="training-pipeline")
def pipeline(data_path: str, output_model_path: str):
    preprocess = preprocess_op(input_path=data_path, 
                               output_path="/data/preprocessed.csv")
    train = train_op(data_path=preprocess.outputs["output_path"],
                     model_path=output_model_path)
    evaluate = evaluate_op(model_path=train.outputs["model_path"],
                           data_path=preprocess.outputs["output_path"])

What is Weights & Biases?

Weights & Biases (W&B) is a cloud-first experiment tracking platform designed for speed and simplicity. It focuses on:

  • Experiment Tracking: Beautiful, real-time visualizations
  • Sweeps: Automated hyperparameter optimization
  • Reports: Collaboration and documentation
  • Artifacts: Model and dataset versioning

W&B Best Practices

import wandb

# Initialize W&B
wandb.init(
    project="credit-scoring",
    entity="my-team",
    name="xgboost-exp-001",
    tags=["production", "xgboost", "v2"],
    notes="Testing new feature engineering approach"
)

# Log training metrics
for epoch in range(num_epochs):
    train_loss = train_epoch(model, train_loader)
    val_loss, val_acc = evaluate(model, val_loader)
    
    # Log metrics to W&B
    wandb.log({
        "epoch": epoch,
        "train_loss": train_loss,
        "val_loss": val_loss,
        "val_accuracy": val_acc,
        "learning_rate": scheduler.get_last_lr()[0]
    })

# Log model gradients for debugging
wandb.watch(model, log="all")

# Save model checkpoint
torch.save(model.state_dict(), "model.pt")
wandb.save("model.pt")

W&B Sweeps for Hyperparameter Tuning

# sweep.yaml
program: train.py
method: bayes
metric:
  name: val_accuracy
  goal: maximize
parameters:
  learning_rate:
    distribution: log_uniform
    min: 0.0001
    max: 0.1
  batch_size:
    values: [16, 32, 64, 128]
  max_depth:
    distribution: int_uniform
    min: 3
    max: 10
  n_estimators:
    values: [100, 200, 500]
  min_child_weight:
    distribution: log_uniform
    min: 1
    max: 10
# train.py with sweep
import wandb

wandb.init()
config = wandb.config

model = XGBClassifier(
    learning_rate=config.learning_rate,
    max_depth=config.max_depth,
    n_estimators=config.n_estimators,
    min_child_weight=config.min_child_weight
)

model.fit(X_train, y_train)
accuracy = accuracy_score(y_test, model.predict(X_test))
wandb.log({"val_accuracy": accuracy})

W&B Artifacts

import wandb

wandb.init(project="ml-pipeline")

# Create dataset artifact
with wandb.init(entity="my-team", project="ml-pipeline", job_type="process-data"):
    processed_data = wandb.Artifact(
        name="processed-dataset",
        type="dataset",
        description="Cleaned and preprocessed customer data"
    )
    processed_data.add_file("data/processed.csv")
    wandb.log_artifact(processed_data)

# Use dataset artifact in training
with wandb.init(entity="my-team", project="ml-pipeline", job_type="train"):
    artifact = wandb.use_artifact("processed-dataset:v0")
    artifact.download()
    
    # Train model
    model = train_model("data/processed.csv")
    
    # Log model as artifact
    model_artifact = wandb.Artifact(name="trained-model", type="model")
    model_artifact.add_file("model.pkl")
    wandb.log_artifact(model_artifact)

Feature Comparison

Feature MLflow Kubeflow W&B
Deployment Self-hosted, cloud Kubernetes-native Cloud-first, self-hosted option
Experiment Tracking Good Basic Excellent
Pipeline Orchestration Limited Excellent Limited
Model Serving Built-in KFServing Via integration
Hyperparameter Tuning Basic Katib Sweeps
Learning Curve Low High Low
Cost Free (open-source) Infrastructure costs Free tier + paid plans
Integration Broad (SKlearn, PyTorch, TF) Kubernetes ecosystem Most ML frameworks

When to Use Each Platform

Use MLflow When:

  • You need a lightweight, flexible experiment tracker
  • You’re already using Databricks or have existing ML infrastructure
  • You need model registry with staging lifecycle
  • You want to package models for portable deployment
# Good: MLflow for quick experiment tracking
import mlflow

mlflow.set_experiment("quick-experiments")
with mlflow.start_run():
    mlflow.log_param("model_type", "random_forest")
    mlflow.log_metric("accuracy", 0.87)
    mlflow.sklearn.log_model(model, "model")

Use Kubeflow When:

  • You’re running ML on Kubernetes at scale
  • You need complex pipeline orchestration
  • Your team has Kubernetes expertise
  • You need enterprise-grade multi-tenancy
# Good: Kubeflow for complex, scalable pipelines
- name: distributed-training
  template:
    dag:
      tasks:
        - name: preprocess
          template: preprocess-container
        - name: train-distributed
          template: pytorch-job
          dependencies: [preprocess]
        - name: evaluate
          template: evaluate-container
          dependencies: [train-distributed]

Use Weights & Biases When:

  • You prioritize visualization and collaboration
  • You need rapid experiment iteration
  • Your team values beautiful dashboards
  • You want minimal infrastructure overhead
# Good: W&B for team collaboration
wandb.init(
    project="team-experiments",
    team="research-group",
    notes="Reproducing paper results"
)

# Team can see real-time results in W&B dashboard
wandb.log({"accuracy": 0.92, "loss": 0.08})

Bad Practices to Avoid

Bad Practice 1: Using Multiple Tracking Servers Without Organization

# Bad: Scattered experiments across different servers
mlflow.set_tracking_uri("http://localhost:5000")  # Local
# Later in code...
mlflow.set_tracking_uri("http://cloud-mlflow:5000")  # Cloud
# Results: Impossible to compare experiments

Bad Practice 2: Not Using Model Registry

# Bad: Manual model versioning
import joblib
joblib.dump(model, "model_v1.pkl")  # Where is v1? What changed?
# Later...
joblib.dump(model, "model_v2.pkl")  # Confusion about lineage

Bad Practice 3: Ignoring Artifact Storage

# Bad: Not logging artifacts, losing important outputs
mlflow.log_metric("accuracy", 0.95)
# Missing: model, feature importance plots, datasets
# Result: Can't reproduce or audit experiments

Good Practices Summary

MLflow Best Practices

  1. Use a consistent experiment naming convention
  2. Log all parameters, metrics, and artifacts
  3. Use Model Registry for production models
  4. Set up proper artifact storage (S3, GCS, Azure Blob)
# Good: Comprehensive experiment logging
import mlflow
from mlflow.tracking import MlflowClient

mlflow.set_experiment("production-models")

with mlflow.start_run(run_name="production-v2.1"):
    # Log everything
    mlflow.log_params(config.to_dict())
    mlflow.log_metrics(metrics)
    mlflow.log_dict(config, "config.yaml")
    mlflow.log_artifact("feature_importance.png")
    
    # Register model
    mlflow.sklearn.log_model(model, "model")
    
    # Transition to production
    client = MlflowClient()
    client.create_model_version(
        name="production-model",
        source=mlflow.get_artifact_uri("model")
    )

Kubeflow Best Practices

  1. Containerize your training code
  2. Use vertex AI for managed Kubeflow
  3. Implement proper resource limits
  4. Use KFP SDK for pipeline definition

W&B Best Practices

  1. Use named runs with clear conventions
  2. Leverage sweeps for hyperparameter search
  3. Create W&B Reports for documentation
  4. Use Artifacts for datasets and models
# Good: W&B with proper organization
wandb.init(
    project="production-ml",
    name=f"experiment-{config.experiment_id}",
    tags=["production", config.model_type],
    config=config.to_dict()  # Log config as part of init
)

# Use config throughout
model = Model(config=config.learning_rate)

# Log everything automatically
wandb.watch(model, log_freq=100)
wandb.log_metrics(metrics)

External Resources

Comments