Kubernetes Production Best Practices

I'll be honest: my first production Kubernetes deployment was a disaster. I thought I understood Kubernetes—I'd run through the tutorials, deployed a few test applications, read the documentation. How hard could it be?

Three days later, I was on a call at 2 AM trying to figure out why our application was randomly crashing, why pods were being evicted, and why our cluster was running out of resources. That experience taught me that running Kubernetes in production is fundamentally different from running it in development.

This guide is what I wish I'd known before that first production deployment. It's based on hard-won experience, late-night debugging sessions, and lessons learned from running Kubernetes clusters serving millions of requests daily.

The Kubernetes Reality Check

Kubernetes is powerful, but it's also complex. The abstraction it provides is valuable, but it comes with a cost: you need to understand how it works under the hood. When things go wrong in production, "it's Kubernetes's fault" isn't a valid excuse—you need to know how to debug and fix issues.

The good news? Once you understand Kubernetes and follow best practices, it's incredibly reliable. The bad news? Getting there requires learning a lot of concepts that aren't immediately obvious.

Security Hardening: Not Optional

Security in Kubernetes is multi-layered, and each layer matters. I've seen teams focus on one aspect (like network policies) while ignoring others (like RBAC), only to discover security vulnerabilities later.

Pod Security Standards: Start Here

Pod Security Standards (PSS) are Kubernetes's way of enforcing security policies at the namespace level. They're relatively new (introduced in Kubernetes 1.23), but they're already essential for production clusters.

There are three policy levels: privileged, baseline, and restricted. For production workloads, you should use restricted:

apiVersion: v1
kind: Namespace
metadata:
  name: production
  labels:
    pod-security.kubernetes.io/enforce: restricted
    pod-security.kubernetes.io/audit: restricted
    pod-security.kubernetes.io/warn: restricted

What PSS Actually Does

The restricted policy enforces:

Containers must not run as root
Containers must drop all capabilities
Containers must not use host namespaces, network, or process IDs
Volumes must not use host paths
Containers must have read-only root filesystems (where possible)

I've seen teams try to deploy applications that violate these policies, then wonder why they're being rejected. The solution isn't to lower the policy level—it's to fix your application to comply with the policy.

The Migration Path

If you're adding PSS to an existing cluster, don't start with enforce: restricted. Start with warn: restricted to see what would be blocked, fix your applications, then move to audit: restricted, and finally enforce: restricted. This gradual approach prevents breaking existing workloads.

Network Policies: Defense in Depth

Network policies are Kubernetes's built-in firewall. They control traffic between pods, and they're essential for multi-tenant clusters or applications with strict security requirements.

The Default Problem

By default, Kubernetes allows all traffic between pods. This is convenient for development but dangerous for production. I've seen teams deploy applications that were accessible from any pod in the cluster, only to discover this during a security audit.

Start with a default-deny policy:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-all
  namespace: production
spec:
  podSelector: {}  # Selects all pods
  policyTypes:
  - Ingress
  - Egress

Then explicitly allow only the traffic you need:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-api-to-database
  namespace: production
spec:
  podSelector:
    matchLabels:
      app: api
  policyTypes:
  - Egress
  egress:
  - to:
    - podSelector:
        matchLabels:
          app: database
    ports:
    - protocol: TCP
      port: 5432

Network Policy Gotchas

Network policies are powerful but have limitations:

They're enforced by the CNI plugin, so behavior varies between plugins
They don't work with host networking
They can't block traffic to the Kubernetes API server
They're namespace-scoped, so cross-namespace policies require careful planning

I've seen teams write complex network policies only to discover their CNI plugin doesn't support all the features they're using. Test your network policies thoroughly with your specific CNI plugin.

RBAC: Least Privilege in Practice

Role-Based Access Control (RBAC) is how you control who can do what in your cluster. It's essential for multi-user clusters, but it's also complex.

ServiceAccounts: Not Optional

Every pod should run with a ServiceAccount. The default ServiceAccount has too many permissions for most applications. Create dedicated ServiceAccounts with minimal permissions:

apiVersion: v1
kind: ServiceAccount
metadata:
  name: app-service-account
  namespace: production
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: app-role
  namespace: production
rules:
- apiGroups: [""]
  resources: ["configmaps", "secrets"]
  verbs: ["get", "list"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: app-role-binding
  namespace: production
subjects:
- kind: ServiceAccount
  name: app-service-account
  namespace: production
roleRef:
  kind: Role
  name: app-role
  apiGroup: rbac.authorization.k8s.io

The Cluster-Admin Trap

I've seen teams give developers cluster-admin access "just to get things working." This is a security nightmare. Use namespaced roles whenever possible, and only grant cluster-wide permissions when absolutely necessary.

Regular Audits

RBAC configurations can become messy over time. I audit RBAC quarterly, looking for:

Roles with excessive permissions
ServiceAccounts not being used
Users with unnecessary access
Missing role bindings

Use tools like kubectl-who-can or rbac-lookup to understand who has access to what.

Resource Management: The Foundation of Stability

Resource management is where many teams struggle. Kubernetes's resource model is powerful but requires understanding.

Resource Requests and Limits: Get Them Right

Resource requests and limits are how you tell Kubernetes what your application needs. Get them wrong, and you'll have problems:

Requests: What You Need

Requests are what Kubernetes uses for scheduling. A pod with requests.cpu: 500m needs half a CPU core available on a node before it can be scheduled. If no node has that capacity, the pod stays pending.

I've seen teams set requests too high, causing scheduling problems, or too low, causing performance issues. The key is to measure your application's actual resource usage and set requests based on that.

resources:
  requests:
    memory: "256Mi"  # Based on actual usage
    cpu: "250m"      # Based on actual usage
  limits:
    memory: "512Mi"  # 2x requests is a good starting point
    cpu: "500m"      # 2x requests is a good starting point

Limits: What You Can Use

Limits are the maximum resources a pod can use. If a pod exceeds its memory limit, it gets killed (OOMKilled). If it exceeds its CPU limit, it gets throttled.

I've seen teams set limits equal to requests, which prevents pods from using additional resources when available. This is overly conservative. Set limits higher than requests to allow pods to burst when resources are available.

The Quality of Service Classes

Kubernetes assigns Quality of Service (QoS) classes based on requests and limits:

Guaranteed: Both requests and limits are set and equal. These pods are last to be evicted.
Burstable: Requests are set but limits are higher, or only requests are set. These pods can be evicted if the node runs out of resources.
BestEffort: No requests or limits. These pods are first to be evicted.

For production workloads, use Guaranteed QoS when possible. It provides the best protection against eviction.

Horizontal Pod Autoscaling: It's Not Magic

Horizontal Pod Autoscaling (HPA) automatically scales the number of pods based on metrics. It's powerful, but it requires tuning:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: app-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-app
  minReplicas: 3
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300  # Wait 5 minutes before scaling down
      policies:
      - type: Percent
        value: 50
        periodSeconds: 60
    scaleUp:
      stabilizationWindowSeconds: 0  # Scale up immediately
      policies:
      - type: Percent
        value: 100
        periodSeconds: 30

HPA Gotchas

I've seen teams set up HPA and then wonder why it's not working:

HPA needs metrics. If metrics aren't available, HPA can't make decisions.
HPA has a default sync period of 15 seconds. Changes don't happen instantly.
HPA scales based on average metrics across all pods. If one pod is overloaded but others aren't, HPA might not scale.
HPA needs headroom to scale. If all nodes are at capacity, HPA can't add more pods.

Custom Metrics

CPU and memory are good starting points, but custom metrics are often better. I've used custom metrics for:

Request rate (requests per second)
Queue depth (for message processing)
Business metrics (orders per minute)

Custom metrics require a metrics adapter (like Prometheus Adapter) and more setup, but they provide better autoscaling decisions.

Vertical Pod Autoscaling: The Overlooked Feature

Vertical Pod Autoscaling (VPA) automatically adjusts resource requests and limits based on actual usage. It's less commonly used than HPA, but it's valuable for right-sizing workloads.

VPA can run in three modes:

Off: Only provides recommendations
Initial: Sets resources when pods are created
Auto: Updates resources for running pods (requires pod recreation)

I use VPA in Off mode to get recommendations, then manually adjust resources. Auto mode is powerful but can cause pod churn as pods are recreated with new resource settings.

Monitoring and Observability: Know What's Happening

You can't operate Kubernetes effectively without good observability. I've seen teams deploy applications without monitoring, only to discover problems when users complain.

Prometheus: The Metrics Backbone

Prometheus is the de facto standard for Kubernetes metrics. It's powerful but requires understanding:

Scrape Configuration

Prometheus discovers targets to scrape using service discovery. For Kubernetes, this means configuring scrape configs:

scrape_configs:
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)

The Cardinality Problem

I've seen Prometheus instances crash due to high cardinality metrics. This happens when you create unique metric labels for things like user IDs or request IDs. Each unique combination of label values creates a new time series, and Prometheus has limits.

Avoid high-cardinality labels. Use them for things like environment, service, or region, not for things like user IDs or request IDs.

Recording Rules

Recording rules pre-compute expensive queries. Use them for:

Aggregating metrics across pods
Computing rates or increases
Creating summary metrics

groups:
  - name: app_rules
    interval: 30s
    rules:
      - record: app:http_requests:rate5m
        expr: rate(http_requests_total[5m])

Logging: Centralized and Structured

Logging in Kubernetes is different from traditional applications. Pods are ephemeral, so logs need to be collected and stored externally.

Fluentd vs. Fluent Bit

Fluentd and Fluent Bit are both log collectors. Fluent Bit is lighter and faster, while Fluentd has more plugins. For Kubernetes, I prefer Fluent Bit—it's designed for containerized environments and has lower resource usage.

Structured Logging

Structured logs are essential for production. They're easier to parse, search, and analyze:

{
  "timestamp": "2025-03-20T10:15:30Z",
  "level": "INFO",
  "service": "api",
  "pod": "api-7d4f8b9c6-abc123",
  "request_id": "req-12345",
  "method": "GET",
  "path": "/api/users",
  "status_code": 200,
  "duration_ms": 45,
  "user_id": "user-789"
}

Log Retention

Logs are expensive to store. I've seen teams store all logs forever, only to realize they're spending more on log storage than on compute. Set retention policies:

Application logs: 7-30 days
Audit logs: 90 days (for compliance)
Security logs: 1 year (for compliance)

Distributed Tracing: Understanding Request Flow

Distributed tracing shows how requests flow through your system. It's essential for microservices architectures.

OpenTelemetry: The Standard

OpenTelemetry is becoming the standard for observability. It provides a single API for metrics, logs, and traces. I'm migrating all my applications to OpenTelemetry.

Sampling: The Performance Trade-off

Tracing every request can be expensive. Use sampling to reduce overhead:

100% sampling for errors
1-10% sampling for successful requests
Adaptive sampling based on load

High Availability: Surviving Failures

High availability isn't just about running multiple replicas—it's about designing your system to survive failures gracefully.

Multi-Zone Deployment: Essential for Production

Deploying across multiple availability zones is essential for production. A single-zone deployment will fail when that zone has issues.

Pod Anti-Affinity

Use pod anti-affinity to ensure pods are distributed across zones:

affinity:
  podAntiAffinity:
    preferredDuringSchedulingIgnoredDuringExecution:
    - weight: 100
      podAffinityTerm:
        labelSelector:
          matchExpressions:
          - key: app
            operator: In
            values:
            - my-app
        topologyKey: topology.kubernetes.io/zone

Zone-Aware Load Balancing

Ensure your load balancer distributes traffic across zones. I've seen load balancers that send all traffic to one zone, defeating the purpose of multi-zone deployment.

StatefulSets: Handling State

StatefulSets are for applications that need stable network identities and persistent storage. They're more complex than Deployments but necessary for stateful workloads.

Headless Services

StatefulSets use headless services (clusterIP: None) to provide stable network identities. Each pod gets a predictable hostname: <statefulset-name>-<ordinal>.<service-name>.<namespace>.svc.cluster.local.

Persistent Volumes

StatefulSets need persistent volumes. Use StatefulSet volume claims:

volumeClaimTemplates:
- metadata:
    name: data
  spec:
    accessModes: [ "ReadWriteOnce" ]
    storageClassName: "fast-ssd"
    resources:
      requests:
        storage: 100Gi

Backup Strategies

Stateful applications need backups. I use:

Volume snapshots for quick recovery
Application-level backups for point-in-time recovery
Cross-region replication for disaster recovery

CI/CD Integration: Automate Everything

Kubernetes and CI/CD go hand in hand. Manual deployments don't scale.

GitOps: Infrastructure as Code

GitOps is the practice of managing infrastructure and applications through Git. ArgoCD and Flux are popular GitOps tools.

Why GitOps?

GitOps provides:

Version control for all changes
Audit trail of who changed what
Rollback capabilities
Consistency across environments

I've seen teams manage Kubernetes manifests manually, only to have configuration drift and deployment issues. GitOps solves these problems.

Image Security: Scan Everything

Container images can contain vulnerabilities. Scan them before deploying:

# In CI/CD pipeline
- name: Scan image
  run: |
    trivy image --severity HIGH,CRITICAL $IMAGE:$TAG
    if [ $? -ne 0 ]; then
      echo "Vulnerabilities found"
      exit 1
    fi

Base Image Selection

Choose base images carefully. I prefer:

Official images from Docker Hub or container registries
Minimal images (Alpine, Distroless) when possible
Regularly updated images

I've seen teams use outdated base images with known vulnerabilities. Keep base images updated.

Backup and Disaster Recovery: Hope for the Best, Plan for the Worst

Disasters happen. Be prepared.

etcd Backups: Critical

etcd stores all Kubernetes cluster state. Back it up regularly:

ETCDCTL_API=3 etcdctl snapshot save /backup/etcd-snapshot-$(date +%Y%m%d).db \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key

Test Your Backups

I've seen teams with backup procedures that don't work when tested. Test your backup and restore procedures regularly. A backup that can't be restored is worse than no backup.

Application Data Backups

Backup application data separately from etcd. Use:

Volume snapshots for quick recovery
Application-level backups for point-in-time recovery
Cross-region replication for disaster recovery

Performance Optimization: Getting the Most from Your Cluster

Kubernetes clusters are expensive. Optimize them.

Node Sizing: Right-Size Your Nodes

Node sizing affects both cost and performance. I've seen teams use oversized nodes "just to be safe," wasting money, or undersized nodes, causing performance issues.

Monitor node utilization:

CPU utilization should be 60-80% on average
Memory utilization should be 70-85% on average
Leave headroom for spikes and system overhead

Cluster Autoscaling: Dynamic Capacity

Cluster Autoscaler automatically adds and removes nodes based on pod scheduling needs. It's essential for cost optimization:

apiVersion: v1
kind: ConfigMap
metadata:
  name: cluster-autoscaler-status
data:
  nodes.min: "3"
  nodes.max: "10"

Cluster Autoscaler Gotchas

Cluster Autoscaler has limitations:

It only scales based on unschedulable pods
It has a scale-down delay (default 10 minutes)
It can't scale below minimum node count
It requires node groups with autoscaling enabled

Pod Disruption Budgets: Protecting Critical Workloads

Pod Disruption Budgets (PDBs) prevent too many pods from being evicted simultaneously:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: app-pdb
spec:
  minAvailable: 2  # Or use maxUnavailable
  selector:
    matchLabels:
      app: my-app

PDBs are essential for maintaining availability during:

Node maintenance
Cluster upgrades
Voluntary pod evictions

Conclusion

Running Kubernetes in production is challenging but rewarding. The key is to start with the fundamentals—security, resource management, and observability—then gradually adopt more advanced features.

Remember: Kubernetes is a tool, not a solution. It requires understanding, discipline, and continuous improvement. But when used correctly, it provides reliability, scalability, and operational efficiency that's hard to achieve with other platforms.

The most important lesson I've learned? Don't try to learn everything at once. Master the fundamentals, deploy to production, learn from your mistakes, and iterate. Kubernetes is a journey, not a destination.