Introduction
As machine learning projects scale from experiments to production, teams need robust MLOps platforms to manage the entire lifecycle. Three leading tools have emerged: MLflow, Kubeflow, and Weights & Biases (W&B). Each serves different needs and comes with distinct trade-offs.
This guide compares these platforms across experiment tracking, model registry, pipeline orchestration, and deployment capabilities to help you choose the right tool for your workflow.
What is MLflow?
MLflow is an open-source MLOps platform developed by Databricks that focuses on the ML lifecycle management. It provides four core components:
- Experiment Tracking: Log parameters, metrics, and artifacts
- Model Registry: Centralized model versioning and staging
- Model Serving: Built-in serving infrastructure
- MLflow Projects: Packaging ML code in a reproducible format
MLflow Best Practices
import mlflow
from mlflow.tracking import MlflowClient
# Set up MLflow tracking
mlflow.set_tracking_uri("http://localhost:5000")
mlflow.set_experiment("credit-scoring-model")
# Start an experiment run
with mlflow.start_run(run_name="xgboost-v2-optimized"):
# Log parameters
mlflow.log_param("learning_rate", 0.05)
mlflow.log_param("max_depth", 6)
mlflow.log_param("n_estimators", 500)
# Train your model
model = train_xgboost_model(X_train, y_train)
# Log metrics
mlflow.log_metric("accuracy", accuracy_score(y_test, model.predict(X_test)))
mlflow.log_metric("f1_score", f1_score(y_test, model.predict(X_test)))
mlflow.log_metric("auc_roc", roc_auc_score(y_test, model.predict_proba(X_test)))
# Log model
mlflow.xgboost.log_model(model, "model")
# Log artifacts
mlflow.log_artifact("feature_importance.png")
mlflow.log_artifact("confusion_matrix.png")
Using MLflow Model Registry
from mlflow.tracking import MlflowClient
client = MlflowClient()
# Register a model from a run
model_uri = "runs:/<run_id>/model"
model_name = "credit-scoring-production"
model_version = mlflow.register_model(model_uri, model_name)
# Transition model through stages
client.transition_model_version_stage(
name=model_name,
version=model_version.version,
stage="Staging"
)
# Add model description
client.set_model_version_tag(
name=model_name,
version=model_version.version,
key="description",
value="XGBoost model trained on 2025-12 dataset"
)
What is Kubeflow?
Kubeflow is an open-source toolkit for running ML workloads on Kubernetes. It provides a comprehensive stack for the entire ML lifecycle:
- Kubeflow Pipelines: Orchestrate complex ML workflows
- Katib: Hyperparameter tuning and neural architecture search
- KFServing: Model serving with serverless inference
- Training Operators: Distributed training on Kubernetes
Kubeflow Pipeline Example
# pipeline.yaml
apiVersion: kubeflow.org/v1alpha2
kind: Pipeline
metadata:
name: ml-training-pipeline
spec:
pipelineSpec:
tasks:
- name: preprocess-data
container:
image: my-preprocess:latest
command: ["python", "preprocess.py"]
args: ["--input", "gs://bucket/raw data.csv",
"--output", "{{$.outputs.artifacts.preprocessed.path}}"]
outputs:
artifacts:
- name: preprocessed
path: /data/preprocessed.csv
- name: train-model
container:
image: my-trainer:latest
command: ["python", "train.py"]
args: ["--data", "{{$.inputs.artifacts.preprocessed.path}}",
"--model", "{{$.outputs.artifacts.model.path}}"]
inputs:
artifacts:
- name: preprocessed
task: preprocess-data
path: /data/input.csv
outputs:
artifacts:
- name: model
path: /model/model.pkl
- name: evaluate-model
container:
image: my-evaluator:latest
command: ["python", "evaluate.py"]
args: ["--model", "{{$.inputs.artifacts.model.path}}",
"--metrics", "{{$.outputs.artifacts.metrics.path}}"]
inputs:
artifacts:
- name: model
task: train-model
path: /model/input.pkl
Kubeflow Python SDK
from kfp import dsl
from kfp.components import create_component_from_func
@create_component_from_func
def preprocess_op(input_path: str, output_path: str):
import pandas as pd
df = pd.read_csv(input_path)
df_clean = df.dropna().fillna(0)
df_clean.to_csv(output_path, index=False)
@create_component_from_func
def train_op(data_path: str, model_path: str):
import joblib
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
df = pd.read_csv(data_path)
X = df.drop('target', axis=1)
y = df['target']
model = RandomForestClassifier(n_estimators=100)
model.fit(X, y)
joblib.dump(model, model_path)
@create_component_from_func
def evaluate_op(model_path: str, data_path: str) -> str:
import joblib
import pandas as pd
from sklearn.metrics import accuracy_score
model = joblib.load(model_path)
df = pd.read_csv(data_path)
X = df.drop('target', axis=1)
y = df['target']
accuracy = accuracy_score(y, model.predict(X))
return json.dumps({"accuracy": accuracy})
@dsl.pipeline(name="training-pipeline")
def pipeline(data_path: str, output_model_path: str):
preprocess = preprocess_op(input_path=data_path,
output_path="/data/preprocessed.csv")
train = train_op(data_path=preprocess.outputs["output_path"],
model_path=output_model_path)
evaluate = evaluate_op(model_path=train.outputs["model_path"],
data_path=preprocess.outputs["output_path"])
What is Weights & Biases?
Weights & Biases (W&B) is a cloud-first experiment tracking platform designed for speed and simplicity. It focuses on:
- Experiment Tracking: Beautiful, real-time visualizations
- Sweeps: Automated hyperparameter optimization
- Reports: Collaboration and documentation
- Artifacts: Model and dataset versioning
W&B Best Practices
import wandb
# Initialize W&B
wandb.init(
project="credit-scoring",
entity="my-team",
name="xgboost-exp-001",
tags=["production", "xgboost", "v2"],
notes="Testing new feature engineering approach"
)
# Log training metrics
for epoch in range(num_epochs):
train_loss = train_epoch(model, train_loader)
val_loss, val_acc = evaluate(model, val_loader)
# Log metrics to W&B
wandb.log({
"epoch": epoch,
"train_loss": train_loss,
"val_loss": val_loss,
"val_accuracy": val_acc,
"learning_rate": scheduler.get_last_lr()[0]
})
# Log model gradients for debugging
wandb.watch(model, log="all")
# Save model checkpoint
torch.save(model.state_dict(), "model.pt")
wandb.save("model.pt")
W&B Sweeps for Hyperparameter Tuning
# sweep.yaml
program: train.py
method: bayes
metric:
name: val_accuracy
goal: maximize
parameters:
learning_rate:
distribution: log_uniform
min: 0.0001
max: 0.1
batch_size:
values: [16, 32, 64, 128]
max_depth:
distribution: int_uniform
min: 3
max: 10
n_estimators:
values: [100, 200, 500]
min_child_weight:
distribution: log_uniform
min: 1
max: 10
# train.py with sweep
import wandb
wandb.init()
config = wandb.config
model = XGBClassifier(
learning_rate=config.learning_rate,
max_depth=config.max_depth,
n_estimators=config.n_estimators,
min_child_weight=config.min_child_weight
)
model.fit(X_train, y_train)
accuracy = accuracy_score(y_test, model.predict(X_test))
wandb.log({"val_accuracy": accuracy})
W&B Artifacts
import wandb
wandb.init(project="ml-pipeline")
# Create dataset artifact
with wandb.init(entity="my-team", project="ml-pipeline", job_type="process-data"):
processed_data = wandb.Artifact(
name="processed-dataset",
type="dataset",
description="Cleaned and preprocessed customer data"
)
processed_data.add_file("data/processed.csv")
wandb.log_artifact(processed_data)
# Use dataset artifact in training
with wandb.init(entity="my-team", project="ml-pipeline", job_type="train"):
artifact = wandb.use_artifact("processed-dataset:v0")
artifact.download()
# Train model
model = train_model("data/processed.csv")
# Log model as artifact
model_artifact = wandb.Artifact(name="trained-model", type="model")
model_artifact.add_file("model.pkl")
wandb.log_artifact(model_artifact)
Feature Comparison
| Feature | MLflow | Kubeflow | W&B |
|---|---|---|---|
| Deployment | Self-hosted, cloud | Kubernetes-native | Cloud-first, self-hosted option |
| Experiment Tracking | Good | Basic | Excellent |
| Pipeline Orchestration | Limited | Excellent | Limited |
| Model Serving | Built-in | KFServing | Via integration |
| Hyperparameter Tuning | Basic | Katib | Sweeps |
| Learning Curve | Low | High | Low |
| Cost | Free (open-source) | Infrastructure costs | Free tier + paid plans |
| Integration | Broad (SKlearn, PyTorch, TF) | Kubernetes ecosystem | Most ML frameworks |
When to Use Each Platform
Use MLflow When:
- You need a lightweight, flexible experiment tracker
- You’re already using Databricks or have existing ML infrastructure
- You need model registry with staging lifecycle
- You want to package models for portable deployment
# Good: MLflow for quick experiment tracking
import mlflow
mlflow.set_experiment("quick-experiments")
with mlflow.start_run():
mlflow.log_param("model_type", "random_forest")
mlflow.log_metric("accuracy", 0.87)
mlflow.sklearn.log_model(model, "model")
Use Kubeflow When:
- You’re running ML on Kubernetes at scale
- You need complex pipeline orchestration
- Your team has Kubernetes expertise
- You need enterprise-grade multi-tenancy
# Good: Kubeflow for complex, scalable pipelines
- name: distributed-training
template:
dag:
tasks:
- name: preprocess
template: preprocess-container
- name: train-distributed
template: pytorch-job
dependencies: [preprocess]
- name: evaluate
template: evaluate-container
dependencies: [train-distributed]
Use Weights & Biases When:
- You prioritize visualization and collaboration
- You need rapid experiment iteration
- Your team values beautiful dashboards
- You want minimal infrastructure overhead
# Good: W&B for team collaboration
wandb.init(
project="team-experiments",
team="research-group",
notes="Reproducing paper results"
)
# Team can see real-time results in W&B dashboard
wandb.log({"accuracy": 0.92, "loss": 0.08})
Bad Practices to Avoid
Bad Practice 1: Using Multiple Tracking Servers Without Organization
# Bad: Scattered experiments across different servers
mlflow.set_tracking_uri("http://localhost:5000") # Local
# Later in code...
mlflow.set_tracking_uri("http://cloud-mlflow:5000") # Cloud
# Results: Impossible to compare experiments
Bad Practice 2: Not Using Model Registry
# Bad: Manual model versioning
import joblib
joblib.dump(model, "model_v1.pkl") # Where is v1? What changed?
# Later...
joblib.dump(model, "model_v2.pkl") # Confusion about lineage
Bad Practice 3: Ignoring Artifact Storage
# Bad: Not logging artifacts, losing important outputs
mlflow.log_metric("accuracy", 0.95)
# Missing: model, feature importance plots, datasets
# Result: Can't reproduce or audit experiments
Good Practices Summary
MLflow Best Practices
- Use a consistent experiment naming convention
- Log all parameters, metrics, and artifacts
- Use Model Registry for production models
- Set up proper artifact storage (S3, GCS, Azure Blob)
# Good: Comprehensive experiment logging
import mlflow
from mlflow.tracking import MlflowClient
mlflow.set_experiment("production-models")
with mlflow.start_run(run_name="production-v2.1"):
# Log everything
mlflow.log_params(config.to_dict())
mlflow.log_metrics(metrics)
mlflow.log_dict(config, "config.yaml")
mlflow.log_artifact("feature_importance.png")
# Register model
mlflow.sklearn.log_model(model, "model")
# Transition to production
client = MlflowClient()
client.create_model_version(
name="production-model",
source=mlflow.get_artifact_uri("model")
)
Kubeflow Best Practices
- Containerize your training code
- Use vertex AI for managed Kubeflow
- Implement proper resource limits
- Use KFP SDK for pipeline definition
W&B Best Practices
- Use named runs with clear conventions
- Leverage sweeps for hyperparameter search
- Create W&B Reports for documentation
- Use Artifacts for datasets and models
# Good: W&B with proper organization
wandb.init(
project="production-ml",
name=f"experiment-{config.experiment_id}",
tags=["production", config.model_type],
config=config.to_dict() # Log config as part of init
)
# Use config throughout
model = Model(config=config.learning_rate)
# Log everything automatically
wandb.watch(model, log_freq=100)
wandb.log_metrics(metrics)
External Resources
- MLflow Documentation
- Kubeflow Documentation
- Weights & Biases Documentation
- MLflow Model Registry Guide
- Kubeflow Pipelines SDK
- W&B Sweeps Documentation
- MLflow vs Kubeflow vs W&B - Comparison Article
- Kubeflow on AWS
- MLflow on Databricks
Comments