Kubernetes Production Best Practices
I'll be honest: my first production Kubernetes deployment was a disaster. I thought I understood Kubernetes—I'd run through the tutorials, deployed a few test applications, read the documentation. How hard could it be?
Three days later, I was on a call at 2 AM trying to figure out why our application was randomly crashing, why pods were being evicted, and why our cluster was running out of resources. That experience taught me that running Kubernetes in production is fundamentally different from running it in development.
This guide is what I wish I'd known before that first production deployment. It's based on hard-won experience, late-night debugging sessions, and lessons learned from running Kubernetes clusters serving millions of requests daily.
The Kubernetes Reality Check
Kubernetes is powerful, but it's also complex. The abstraction it provides is valuable, but it comes with a cost: you need to understand how it works under the hood. When things go wrong in production, "it's Kubernetes's fault" isn't a valid excuse—you need to know how to debug and fix issues.
The good news? Once you understand Kubernetes and follow best practices, it's incredibly reliable. The bad news? Getting there requires learning a lot of concepts that aren't immediately obvious.
Security Hardening: Not Optional
Security in Kubernetes is multi-layered, and each layer matters. I've seen teams focus on one aspect (like network policies) while ignoring others (like RBAC), only to discover security vulnerabilities later.
Pod Security Standards: Start Here
Pod Security Standards (PSS) are Kubernetes's way of enforcing security policies at the namespace level. They're relatively new (introduced in Kubernetes 1.23), but they're already essential for production clusters.
There are three policy levels: privileged, baseline, and restricted. For production workloads, you should use restricted:
apiVersion: v1
kind: Namespace
metadata:
name: production
labels:
pod-security.kubernetes.io/enforce: restricted
pod-security.kubernetes.io/audit: restricted
pod-security.kubernetes.io/warn: restricted
What PSS Actually Does
The restricted policy enforces:
- Containers must not run as root
- Containers must drop all capabilities
- Containers must not use host namespaces, network, or process IDs
- Volumes must not use host paths
- Containers must have read-only root filesystems (where possible)
I've seen teams try to deploy applications that violate these policies, then wonder why they're being rejected. The solution isn't to lower the policy level—it's to fix your application to comply with the policy.
The Migration Path
If you're adding PSS to an existing cluster, don't start with enforce: restricted. Start with warn: restricted to see what would be blocked, fix your applications, then move to audit: restricted, and finally enforce: restricted. This gradual approach prevents breaking existing workloads.
Network Policies: Defense in Depth
Network policies are Kubernetes's built-in firewall. They control traffic between pods, and they're essential for multi-tenant clusters or applications with strict security requirements.
The Default Problem
By default, Kubernetes allows all traffic between pods. This is convenient for development but dangerous for production. I've seen teams deploy applications that were accessible from any pod in the cluster, only to discover this during a security audit.
Start with a default-deny policy:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-all
namespace: production
spec:
podSelector: {} # Selects all pods
policyTypes:
- Ingress
- Egress
Then explicitly allow only the traffic you need:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-api-to-database
namespace: production
spec:
podSelector:
matchLabels:
app: api
policyTypes:
- Egress
egress:
- to:
- podSelector:
matchLabels:
app: database
ports:
- protocol: TCP
port: 5432
Network Policy Gotchas
Network policies are powerful but have limitations:
- They're enforced by the CNI plugin, so behavior varies between plugins
- They don't work with host networking
- They can't block traffic to the Kubernetes API server
- They're namespace-scoped, so cross-namespace policies require careful planning
I've seen teams write complex network policies only to discover their CNI plugin doesn't support all the features they're using. Test your network policies thoroughly with your specific CNI plugin.
RBAC: Least Privilege in Practice
Role-Based Access Control (RBAC) is how you control who can do what in your cluster. It's essential for multi-user clusters, but it's also complex.
ServiceAccounts: Not Optional
Every pod should run with a ServiceAccount. The default ServiceAccount has too many permissions for most applications. Create dedicated ServiceAccounts with minimal permissions:
apiVersion: v1
kind: ServiceAccount
metadata:
name: app-service-account
namespace: production
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: app-role
namespace: production
rules:
- apiGroups: [""]
resources: ["configmaps", "secrets"]
verbs: ["get", "list"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: app-role-binding
namespace: production
subjects:
- kind: ServiceAccount
name: app-service-account
namespace: production
roleRef:
kind: Role
name: app-role
apiGroup: rbac.authorization.k8s.io
The Cluster-Admin Trap
I've seen teams give developers cluster-admin access "just to get things working." This is a security nightmare. Use namespaced roles whenever possible, and only grant cluster-wide permissions when absolutely necessary.
Regular Audits
RBAC configurations can become messy over time. I audit RBAC quarterly, looking for:
- Roles with excessive permissions
- ServiceAccounts not being used
- Users with unnecessary access
- Missing role bindings
Use tools like kubectl-who-can or rbac-lookup to understand who has access to what.
Resource Management: The Foundation of Stability
Resource management is where many teams struggle. Kubernetes's resource model is powerful but requires understanding.
Resource Requests and Limits: Get Them Right
Resource requests and limits are how you tell Kubernetes what your application needs. Get them wrong, and you'll have problems:
Requests: What You Need
Requests are what Kubernetes uses for scheduling. A pod with requests.cpu: 500m needs half a CPU core available on a node before it can be scheduled. If no node has that capacity, the pod stays pending.
I've seen teams set requests too high, causing scheduling problems, or too low, causing performance issues. The key is to measure your application's actual resource usage and set requests based on that.
resources:
requests:
memory: "256Mi" # Based on actual usage
cpu: "250m" # Based on actual usage
limits:
memory: "512Mi" # 2x requests is a good starting point
cpu: "500m" # 2x requests is a good starting point
Limits: What You Can Use
Limits are the maximum resources a pod can use. If a pod exceeds its memory limit, it gets killed (OOMKilled). If it exceeds its CPU limit, it gets throttled.
I've seen teams set limits equal to requests, which prevents pods from using additional resources when available. This is overly conservative. Set limits higher than requests to allow pods to burst when resources are available.
The Quality of Service Classes
Kubernetes assigns Quality of Service (QoS) classes based on requests and limits:
- Guaranteed: Both requests and limits are set and equal. These pods are last to be evicted.
- Burstable: Requests are set but limits are higher, or only requests are set. These pods can be evicted if the node runs out of resources.
- BestEffort: No requests or limits. These pods are first to be evicted.
For production workloads, use Guaranteed QoS when possible. It provides the best protection against eviction.
Horizontal Pod Autoscaling: It's Not Magic
Horizontal Pod Autoscaling (HPA) automatically scales the number of pods based on metrics. It's powerful, but it requires tuning:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: app-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: my-app
minReplicas: 3
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
behavior:
scaleDown:
stabilizationWindowSeconds: 300 # Wait 5 minutes before scaling down
policies:
- type: Percent
value: 50
periodSeconds: 60
scaleUp:
stabilizationWindowSeconds: 0 # Scale up immediately
policies:
- type: Percent
value: 100
periodSeconds: 30
HPA Gotchas
I've seen teams set up HPA and then wonder why it's not working:
- HPA needs metrics. If metrics aren't available, HPA can't make decisions.
- HPA has a default sync period of 15 seconds. Changes don't happen instantly.
- HPA scales based on average metrics across all pods. If one pod is overloaded but others aren't, HPA might not scale.
- HPA needs headroom to scale. If all nodes are at capacity, HPA can't add more pods.
Custom Metrics
CPU and memory are good starting points, but custom metrics are often better. I've used custom metrics for:
- Request rate (requests per second)
- Queue depth (for message processing)
- Business metrics (orders per minute)
Custom metrics require a metrics adapter (like Prometheus Adapter) and more setup, but they provide better autoscaling decisions.
Vertical Pod Autoscaling: The Overlooked Feature
Vertical Pod Autoscaling (VPA) automatically adjusts resource requests and limits based on actual usage. It's less commonly used than HPA, but it's valuable for right-sizing workloads.
VPA can run in three modes:
- Off: Only provides recommendations
- Initial: Sets resources when pods are created
- Auto: Updates resources for running pods (requires pod recreation)
I use VPA in Off mode to get recommendations, then manually adjust resources. Auto mode is powerful but can cause pod churn as pods are recreated with new resource settings.
Monitoring and Observability: Know What's Happening
You can't operate Kubernetes effectively without good observability. I've seen teams deploy applications without monitoring, only to discover problems when users complain.
Prometheus: The Metrics Backbone
Prometheus is the de facto standard for Kubernetes metrics. It's powerful but requires understanding:
Scrape Configuration
Prometheus discovers targets to scrape using service discovery. For Kubernetes, this means configuring scrape configs:
scrape_configs:
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
The Cardinality Problem
I've seen Prometheus instances crash due to high cardinality metrics. This happens when you create unique metric labels for things like user IDs or request IDs. Each unique combination of label values creates a new time series, and Prometheus has limits.
Avoid high-cardinality labels. Use them for things like environment, service, or region, not for things like user IDs or request IDs.
Recording Rules
Recording rules pre-compute expensive queries. Use them for:
- Aggregating metrics across pods
- Computing rates or increases
- Creating summary metrics
groups:
- name: app_rules
interval: 30s
rules:
- record: app:http_requests:rate5m
expr: rate(http_requests_total[5m])
Logging: Centralized and Structured
Logging in Kubernetes is different from traditional applications. Pods are ephemeral, so logs need to be collected and stored externally.
Fluentd vs. Fluent Bit
Fluentd and Fluent Bit are both log collectors. Fluent Bit is lighter and faster, while Fluentd has more plugins. For Kubernetes, I prefer Fluent Bit—it's designed for containerized environments and has lower resource usage.
Structured Logging
Structured logs are essential for production. They're easier to parse, search, and analyze:
{
"timestamp": "2025-03-20T10:15:30Z",
"level": "INFO",
"service": "api",
"pod": "api-7d4f8b9c6-abc123",
"request_id": "req-12345",
"method": "GET",
"path": "/api/users",
"status_code": 200,
"duration_ms": 45,
"user_id": "user-789"
}
Log Retention
Logs are expensive to store. I've seen teams store all logs forever, only to realize they're spending more on log storage than on compute. Set retention policies:
- Application logs: 7-30 days
- Audit logs: 90 days (for compliance)
- Security logs: 1 year (for compliance)
Distributed Tracing: Understanding Request Flow
Distributed tracing shows how requests flow through your system. It's essential for microservices architectures.
OpenTelemetry: The Standard
OpenTelemetry is becoming the standard for observability. It provides a single API for metrics, logs, and traces. I'm migrating all my applications to OpenTelemetry.
Sampling: The Performance Trade-off
Tracing every request can be expensive. Use sampling to reduce overhead:
- 100% sampling for errors
- 1-10% sampling for successful requests
- Adaptive sampling based on load
High Availability: Surviving Failures
High availability isn't just about running multiple replicas—it's about designing your system to survive failures gracefully.
Multi-Zone Deployment: Essential for Production
Deploying across multiple availability zones is essential for production. A single-zone deployment will fail when that zone has issues.
Pod Anti-Affinity
Use pod anti-affinity to ensure pods are distributed across zones:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- my-app
topologyKey: topology.kubernetes.io/zone
Zone-Aware Load Balancing
Ensure your load balancer distributes traffic across zones. I've seen load balancers that send all traffic to one zone, defeating the purpose of multi-zone deployment.
StatefulSets: Handling State
StatefulSets are for applications that need stable network identities and persistent storage. They're more complex than Deployments but necessary for stateful workloads.
Headless Services
StatefulSets use headless services (clusterIP: None) to provide stable network identities. Each pod gets a predictable hostname: <statefulset-name>-<ordinal>.<service-name>.<namespace>.svc.cluster.local.
Persistent Volumes
StatefulSets need persistent volumes. Use StatefulSet volume claims:
volumeClaimTemplates:
- metadata:
name: data
spec:
accessModes: [ "ReadWriteOnce" ]
storageClassName: "fast-ssd"
resources:
requests:
storage: 100Gi
Backup Strategies
Stateful applications need backups. I use:
- Volume snapshots for quick recovery
- Application-level backups for point-in-time recovery
- Cross-region replication for disaster recovery
CI/CD Integration: Automate Everything
Kubernetes and CI/CD go hand in hand. Manual deployments don't scale.
GitOps: Infrastructure as Code
GitOps is the practice of managing infrastructure and applications through Git. ArgoCD and Flux are popular GitOps tools.
Why GitOps?
GitOps provides:
- Version control for all changes
- Audit trail of who changed what
- Rollback capabilities
- Consistency across environments
I've seen teams manage Kubernetes manifests manually, only to have configuration drift and deployment issues. GitOps solves these problems.
Image Security: Scan Everything
Container images can contain vulnerabilities. Scan them before deploying:
# In CI/CD pipeline
- name: Scan image
run: |
trivy image --severity HIGH,CRITICAL $IMAGE:$TAG
if [ $? -ne 0 ]; then
echo "Vulnerabilities found"
exit 1
fi
Base Image Selection
Choose base images carefully. I prefer:
- Official images from Docker Hub or container registries
- Minimal images (Alpine, Distroless) when possible
- Regularly updated images
I've seen teams use outdated base images with known vulnerabilities. Keep base images updated.
Backup and Disaster Recovery: Hope for the Best, Plan for the Worst
Disasters happen. Be prepared.
etcd Backups: Critical
etcd stores all Kubernetes cluster state. Back it up regularly:
ETCDCTL_API=3 etcdctl snapshot save /backup/etcd-snapshot-$(date +%Y%m%d).db \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key
Test Your Backups
I've seen teams with backup procedures that don't work when tested. Test your backup and restore procedures regularly. A backup that can't be restored is worse than no backup.
Application Data Backups
Backup application data separately from etcd. Use:
- Volume snapshots for quick recovery
- Application-level backups for point-in-time recovery
- Cross-region replication for disaster recovery
Performance Optimization: Getting the Most from Your Cluster
Kubernetes clusters are expensive. Optimize them.
Node Sizing: Right-Size Your Nodes
Node sizing affects both cost and performance. I've seen teams use oversized nodes "just to be safe," wasting money, or undersized nodes, causing performance issues.
Monitor node utilization:
- CPU utilization should be 60-80% on average
- Memory utilization should be 70-85% on average
- Leave headroom for spikes and system overhead
Cluster Autoscaling: Dynamic Capacity
Cluster Autoscaler automatically adds and removes nodes based on pod scheduling needs. It's essential for cost optimization:
apiVersion: v1
kind: ConfigMap
metadata:
name: cluster-autoscaler-status
data:
nodes.min: "3"
nodes.max: "10"
Cluster Autoscaler Gotchas
Cluster Autoscaler has limitations:
- It only scales based on unschedulable pods
- It has a scale-down delay (default 10 minutes)
- It can't scale below minimum node count
- It requires node groups with autoscaling enabled
Pod Disruption Budgets: Protecting Critical Workloads
Pod Disruption Budgets (PDBs) prevent too many pods from being evicted simultaneously:
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: app-pdb
spec:
minAvailable: 2 # Or use maxUnavailable
selector:
matchLabels:
app: my-app
PDBs are essential for maintaining availability during:
- Node maintenance
- Cluster upgrades
- Voluntary pod evictions
Conclusion
Running Kubernetes in production is challenging but rewarding. The key is to start with the fundamentals—security, resource management, and observability—then gradually adopt more advanced features.
Remember: Kubernetes is a tool, not a solution. It requires understanding, discipline, and continuous improvement. But when used correctly, it provides reliability, scalability, and operational efficiency that's hard to achieve with other platforms.
The most important lesson I've learned? Don't try to learn everything at once. Master the fundamentals, deploy to production, learn from your mistakes, and iterate. Kubernetes is a journey, not a destination.