Introduction
Platform engineering has emerged as a critical discipline for organizations seeking to scale their development teams while maintaining velocity and reliability. By building internal platforms that provide self-service capabilities, organizations can reduce the friction between development and operations teams, accelerate developer productivity, and enforce organizational standards automatically.
This comprehensive guide covers everything you need to know about platform engineering - from understanding the fundamental concepts to building production-ready internal developer platforms. We’ll explore architectural patterns, implementation strategies, and real-world examples that you can adapt to your organization’s needs.
Whether you’re just starting to explore platform engineering or looking to improve your existing platform, this guide provides the knowledge and practical patterns you need to succeed.
Understanding Platform Engineering
The Evolution from DevOps to Platform Engineering
Traditional DevOps aimed to break down silos between development and operations teams. While successful, it often created a new bottleneck: operations teams still had to handle numerous requests for infrastructure, configurations, and access. Platform engineering evolves this model by creating self-service capabilities that empower developers while maintaining governance.
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Platform Engineering Evolution โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ Traditional โ --> โ DevOps โ --> โ Platform Engineering โ โ
โ โ Siloed โ โ Teams โ โ Self-Service โ โ
โ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ
โ โข Manual ops โข Shared ownership โข Self-service platform โ
โ โข Slow changes โข Automation โข Golden paths โ
โ โข Knowledge silos โข Collaboration โข Developer experience โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Core Concepts
| Concept | Description |
|---|---|
| Internal Developer Platform (IDP) | Internal product that provides self-service capabilities |
| Self-Service | Developers can provision resources without manual intervention |
| Golden Paths | Opinionated, supported, and recommended approaches |
| Paved Roads | Standardized, automated workflows that are safe to follow |
| Developer Experience (DevX) | The overall experience developers have doing their work |
| Platform Team | Team responsible for building and maintaining the platform |
Platform Team Responsibilities
The platform team acts as an internal product team:
# Platform team responsibilities
responsibilities = {
"product_management": "Define and prioritize platform capabilities",
"platform_development": "Build and maintain platform components",
"developer_relations": "Support and train platform users",
"incident_response": "Handle platform-related incidents",
"governance": "Ensure standards and compliance",
"documentation": "Maintain up-to-date platform docs"
}
Building Your Internal Developer Platform
Assessment Phase
Before building, understand your current state:
# Developer Experience Assessment
assessment_questions = {
"time_questions": [
"How long does it take to provision a new environment?",
"How long to deploy a simple change to production?",
"How long to get access to required resources?",
],
"friction_questions": [
"What manual processes cause delays?",
"What tools cause the most frustration?",
"What requests are most common to ops team?",
],
"value_questions": [
"What would developers self-service if possible?",
"What would have biggest impact on productivity?",
"What standards are most important to enforce?",
]
}
# Calculate current state metrics
current_state = {
"avg_time_to_provision": "2 weeks",
"deployment_frequency": "monthly",
"self_service_percentage": "20%",
"developer_satisfaction": "6/10",
"incident_volume": "50/month",
}
Platform Architecture
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Internal Developer Platform โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ โ Developer Portal (Backstage) โโ
โ โ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโโโ โโ
โ โ โ Service โ โ Resource โ โ Documentation โ โโ
โ โ โ Catalog โ โ Provisioningโ โ & Guides โ โโ
โ โ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโโโ โโ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ โ โ
โ โผ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ โ Platform APIs & Services โโ
โ โ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโโโ โโ
โ โ โ Provisioningโ โ Pipeline โ โ Observability โ โโ
โ โ โ Service โ โ Service โ โ Service โ โโ
โ โ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโโโ โโ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ โ โ
โ โผ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ โ Infrastructure Layer โโ
โ โ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโโโ โโ
โ โ โ Kubernetes โ โ Cloud โ โ Databases & Services โ โโ
โ โ โ Cluster โ โ Provider โ โ โ โโ
โ โ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโโโ โโ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Self-Service Infrastructure
Database Provisioning
# Database provisioning service
from dataclasses import dataclass
from typing import Optional, List
import asyncio
@dataclass
class DatabaseSpec:
"""Database specification for self-service provisioning."""
name: str
type: str # postgres, mysql, mongodb, redis
size: str # small, medium, large, xlarge
backup_enabled: bool = True
backup_retention_days: int = 30
high_availability: bool = False
encrypted: bool = True
class DatabaseProvisioner:
"""Self-service database provisioning."""
def __init__(self, cloud_client, secrets_manager):
self.cloud = cloud_client
self.secrets = secrets_manager
async def provision(self, spec: DatabaseSpec, owner: str) -> dict:
"""Provision a database based on spec."""
# Create database instance
db = await self.cloud.create_database(
instance_type=self._get_instance_type(spec.size),
storage=self._get_storage_size(spec.size),
engine=spec.type,
high_availability=spec.high_availability,
encrypted=spec.encrypted
)
# Configure backup
if spec.backup_enabled:
await self.cloud.enable_backup(
db.id,
retention_days=spec.backup_retention_days
)
# Generate credentials
credentials = self._generate_credentials()
# Store in secrets manager
secret_name = f"db/{spec.name}"
await self.secrets.create(
secret_name,
{
"host": db.endpoint,
"port": db.port,
"username": credentials.username,
"password": credentials.password,
"database": spec.name
},
owner=owner
)
return {
"id": db.id,
"endpoint": db.endpoint,
"status": "ready",
"secret_name": secret_name,
"connection_string": f"postgresql://{credentials.username}:{credentials.password}@{db.endpoint}/{spec.name}"
}
def _get_instance_type(self, size: str) -> str:
sizes = {
"small": "db.t3.micro",
"medium": "db.t3.medium",
"large": "db.r5.large",
"xlarge": "db.r5.xlarge"
}
return sizes.get(size, "db.t3.micro")
def _get_storage_size(self, size: str) -> int:
sizes = {"small": 20, "medium": 50, "large": 100, "xlarge": 500}
return sizes.get(size, 20)
def _generate_credentials(self):
# Generate secure credentials
import secrets
return type('Credentials', (), {
'username': f"app_{secrets.token_hex(4)}",
'password': secrets.token_urlsafe(16)
})()
Environment Management
# Environment provisioning CRD
apiVersion: platform.example.com/v1
kind: Environment
metadata:
name: myapp-staging
namespace: platform
spec:
type: environment
owner: team-alpha
cluster: staging-cluster
namespace: myapp-staging
services:
- name: api
replicas: 2
resources:
cpu: "500m"
memory: "512Mi"
- name: worker
replicas: 1
resources:
cpu: "1000m"
memory: "1Gi"
databases:
- name: app-db
type: postgres
size: medium
secrets:
- name: api-keys
source: vault-secrets
networking:
ingress:
enabled: true
hostname: myapp.staging.example.com
egress:
allowed: all
monitoring:
enabled: true
alertChannel: slack-staging
Golden Paths
Golden paths are opinionated, supported approaches that make it easy to do the right thing:
# Golden path: Standard Service
apiVersion: platform.example.com/v1
kind: GoldenPath
metadata:
name: standard-service
spec:
description: Approved path for production services
# Required components
required:
- type: github-repository
ci: github-actions
default-branch: main
branch-protection: true
required-checks:
- lint
- test
- security-scan
- type: kubernetes
deployment:
replicas: 2
strategy: rolling
health-checks: true
- type: service-mesh
mtls: required
observability: enabled
- type: logging
retention-days: 30
sampling-rate: 100
- type: monitoring
metrics: enabled
alerts:
- error-rate
- latency-p99
- availability
# Optional components
optional:
- type: database
types: [postgres, mysql, redis]
backup: required
- type: cache
types: [redis, memcached]
- type: message-queue
types: [rabbitmq, kafka]
# Standards enforced
standards:
- programming-language: [go, python, typescript, java]
- testing: 80% minimum coverage
- security: SAST + dependency scan
- license: approved licenses only
Developer Portal Implementation
Backstage Setup
// Backstage app configuration
import { createApp } from '@backstage/core';
import { catalogPlugin } from '@backstage/plugin-catalog';
import { scaffolderPlugin } from '@backstage/plugin-scaffolder';
import { kubernetesPlugin } from '@backstage/plugin-kubernetes';
import { ArgoCDPlugin } from '@backstage/plugin-argocd';
const app = createApp({
plugins: [
catalogPlugin(),
scaffolderPlugin(),
kubernetesPlugin(),
ArgoCDPlugin(),
],
theme: {
type: 'light',
colors: {
primary: '#0066cc',
secondary: '#0099ff',
},
},
});
export default app;
Custom Scaffolder Actions
// Custom action: Provision database
import { createTemplateAction } from '@backstage/plugin-scaffolder';
import { ScmIntegrations } from '@backstage/integration';
import { createDatabaseAction } from './actions/database';
import { createNamespaceAction } from './actions/namespace';
import { setupGitHubActionsAction } from './actions/github-actions';
export const platformActions = [
createDatabaseAction({
databaseHost: process.env.DATABASE_HOST,
databasePort: process.env.DATABASE_PORT,
}),
createNamespaceAction({
kubernetesCluster: process.env.K8S_CLUSTER,
}),
setupGitHubActionsAction({
integrations: ScmIntegrations.fromConfig(config),
}),
];
// Using the action in a template
# Template: New Service
apiVersion: scaffolder.backstage.io/v1beta3
kind: Template
metadata:
name: new-service
title: Create New Service
spec:
owner: platform-team
type: service
parameters:
- title: Service Information
required:
- serviceName
- description
properties:
serviceName:
type: string
title: Service Name
description:
type: string
title: Description
team:
type: string
title: Owner Team
steps:
- id: fetch-template
name: Fetch Template
action: fetch:template
input:
url: https://github.com/example/service-template
values:
serviceName: ${{ parameters.serviceName }}
- id: create-namespace
name: Create Kubernetes Namespace
action: platform:kubernetes:namespace:create
input:
name: ${{ parameters.serviceName }}
team: ${{ parameters.team }}
- id: provision-database
name: Provision Database
action: platform:database:create
input:
name: ${{ parameters.serviceName }}
type: postgres
size: small
owner: ${{ parameters.team }}
- id: register-catalog
name: Register in Catalog
action: catalog:register
input:
entityRef: ${{ steps.fetch-template.output.entityRef }}
- id: publish
name: Publish
action: publish:github
input:
repoUrl: ${{ steps.fetch-template.output.repoUrl }}
Service Catalog
// Service entity definition
import { Entity } from '@backstage/catalog-model';
const myServiceEntity: Entity = {
apiVersion: 'backstage.io/v1alpha1',
kind: 'Component',
metadata: {
name: 'payment-service',
description: 'Payment processing service',
tags: ['java', 'spring-boot', 'payment'],
annotations: {
'github/project-slug': 'company/payment-service',
'backstage.io/techdocs-ref': 'dir:.',
'argocd/app-name': 'payment-service',
'kubernetes/namespace': 'payment',
},
},
spec: {
type: 'service',
lifecycle: 'production',
owner: 'platform-team',
system: 'payments',
},
relations: [
{
type: 'ownedBy',
targetRef: 'group:default/platform-team',
},
{
type: 'dependsOn',
targetRef: 'resource:default/payment-database',
},
{
type: 'partOf',
targetRef: 'system:default/payments',
},
],
};
Platform as Code
GitOps for Platform
# Platform configuration in Git
apiVersion: platform.example.com/v1
kind: PlatformConfiguration
metadata:
name: company-platform
namespace: platform
spec:
# Cluster configuration
clusters:
- name: production
region: us-east-1
nodePools:
- name: general
minSize: 3
maxSize: 10
instanceType: m5.xlarge
- name: memory
minSize: 2
maxSize: 5
instanceType: r5.xlarge
- name: staging
region: us-east-1
nodePools:
- name: general
minSize: 2
maxSize: 5
instanceType: m5.large
# Service mesh configuration
serviceMesh:
provider: istio
mtls: strict
tracing:
provider: jaeger
samplingRate: 10
# Observability
observability:
prometheus:
retention: 30d
storage: 200Gi
logging:
retention: 14d
aggregation: elasticsearch
alerts:
channels:
- type: slack
webhookUrl: https://hooks.slack.com/...
- type: pagerduty
integrationKey: xxx
# Security
security:
networkPolicies: true
podSecurityPolicy: restricted
secretEncryption:
provider: vault
imageScanning: true
Terraform for Platform
# Platform infrastructure as code
module "platform_cluster" {
source = "./modules/eks"
cluster_name = "platform-eks"
cluster_version = "1.28"
region = "us-east-1"
node_groups = {
general = {
min_size = 3
max_size = 10
desired_size = 3
instance_types = ["m5.xlarge"]
}
memory = {
min_size = 2
max_size = 5
desired_size = 2
instance_types = ["r5.xlarge"]
}
}
# Enable platform services
add_ons = {
vpc-cni = true
coredns = true
kube-proxy = true
aws-load-balancer = true
metrics-server = true
secrets-store-csi = true
}
}
module "platform_database" {
source = "./modules/rds"
identifier = "platform-db"
engine = "postgres"
engine_version = "15.4"
instance_class = "db.r6g.large"
allocated_storage = 100
backup_retention_period = 30
deletion_protection = true
tags = {
Platform = "true"
ManagedBy = "terraform"
}
}
Metrics and Feedback
Platform Metrics
// Key platform metrics
const platformMetrics = {
// Adoption metrics
adoption: {
dailyActiveUsers: "DAU of platform",
weeklyActiveUsers: "WAU of platform",
servicesOnPlatform: "Total services using platform",
selfServiceRate: "% of resources provisioned self-service",
},
// Efficiency metrics
efficiency: {
timeToProvision: "Average time to provision resources",
timeToDeploy: "Average time from commit to production",
deploymentFrequency: "Deployments per day",
leadTime: "Time from request to delivery",
},
// Reliability metrics
reliability: {
platformUptime: "Platform availability",
incidentRate: "Incidents per month",
mttr: "Mean time to recovery",
supportTickets: "Platform-related tickets",
},
// Satisfaction metrics
satisfaction: {
developerNPS: "Net Promoter Score",
csat: "Customer Satisfaction Score",
feedbackCount: "Feedback submissions per quarter",
}
};
// Example Prometheus queries
const queries = {
// Adoption
dailyActiveUsers: `sum(count({app="platform"} |= "action" | json) by (user))`,
// Efficiency
avgProvisionTime: `avg(platform_provision_duration_seconds)`,
// Reliability
platformUptime: `1 - (sum(rate(platform_errors_total[5m])) / sum(rate(platform_requests_total[5m])))`,
// Satisfaction
developerNps: `avg(platform_developer_nps_score)`,
};
Feedback Collection
# Feedback collection system
from dataclasses import dataclass
from typing import List
from datetime import datetime
@dataclass
class Feedback:
"""Platform feedback."""
user_id: str
feedback_type: str # suggestion, bug, compliment, complaint
category: str
description: str
rating: int # 1-5
created_at: datetime
class FeedbackCollector:
"""Collect and analyze platform feedback."""
def __init__(self, database):
self.db = database
async def submit_feedback(self, feedback: Feedback) -> None:
"""Submit user feedback."""
await self.db.feedback.insert(feedback)
async def get_nps_score(self) -> float:
"""Calculate Net Promoter Score."""
promoters = await self.db.feedback.count(
feedback_type='compliment',
rating__gte=9
)
detractors = await self.db.feedback.count(
feedback_type='complaint',
rating__lte=6
)
total = await self.db.feedback.count()
return ((promoters - detractors) / total * 100) if total > 0 else 0
async def get_top_issues(self, limit: int = 10) -> List[dict]:
"""Get most common issues."""
return await self.db.feedback.aggregate([
{'$match': {'feedback_type': 'complaint'}},
{'$group': {'_id': '$category', 'count': {'$sum': 1}}},
{'$sort': {'count': -1}},
{'$limit': limit}
])
Best Practices
| Practice | Implementation |
|---|---|
| Start with developer needs | Survey before building |
| Build incrementally | Start with highest impact services |
| Treat platform as product | UX matters for internal tools |
| Document everything | Self-service requires good docs |
| Measure everything | Track adoption and satisfaction |
| Iterate based on feedback | Regular improvement cycles |
| Ensure security by default | Don’t make it opt-in |
| Automate compliance | Embed in platform, not process |
Conclusion
Platform engineering represents a mature approach to developer experience, combining the best of DevOps practices with product thinking. By building internal developer platforms that provide self-service capabilities, organizations can dramatically improve developer productivity while maintaining governance and security.
Key takeaways:
- Start with assessment - Understand current pain points before building
- Build incrementally - Start with high-impact, low-complexity services
- Focus on developer experience - The platform is a product
- Create golden paths - Make it easy to do the right thing
- Measure everything - Track adoption, efficiency, and satisfaction
- Iterate continuously - Platform engineering is never “done”
By following these patterns and principles, you’ll create a platform that developers love to use while enabling your organization to scale effectively.
Resources
- Platform Engineering Community
- Backstage Documentation
- Humanitec Platform
- Internal Developer Platform Maturity Model
- Platform Engineering Best Practices
Comments