Monitoring and Observability in Cloud-Native Systems
I remember the first time I had to debug a production incident without proper observability. It was 2 AM, users were complaining, and I had no idea what was happening. I was looking at logs scattered across multiple servers, trying to piece together what went wrong. It took four hours to find the root cause—a database connection pool that had exhausted itself.
That experience taught me that observability isn't a nice-to-have—it's essential. In cloud-native systems, where services are distributed, ephemeral, and constantly changing, you can't operate effectively without comprehensive observability.
This guide shares what I've learned building observability systems for applications serving millions of requests daily. These aren't theoretical concepts—they're practices I've used in production and refined through countless incidents.
Observability vs. Monitoring: Understanding the Difference
Let me start by clarifying a common confusion: monitoring and observability are related but different.
Monitoring is what you build when you know what to look for. You set up dashboards for known metrics, create alerts for known failure modes. It's reactive—you're watching for problems you've seen before.
Observability is what you need when you don't know what to look for. It's the ability to understand your system's internal state by examining its outputs. It's proactive—you can explore and discover issues you didn't anticipate.
In practice, you need both. Monitoring catches known problems quickly. Observability helps you understand unknown problems.
The Three Pillars: Metrics, Logs, and Traces
The "three pillars of observability" is a useful mental model, but it's also a simplification. In reality, these three work together, and you need all of them to understand your system.
Metrics: Quantitative System Behavior
Metrics are numerical measurements over time. They're efficient to store and query, making them ideal for dashboards and alerting.
Metric Types: Understanding the Basics
Prometheus defines four metric types, and understanding them is crucial:
Counters: Incremental values that only go up. Use them for things like total requests, total errors, total bytes processed.
from prometheus_client import Counter
request_count = Counter('http_requests_total', 'Total HTTP requests', ['method', 'status'])
# Increment counter
request_count.labels(method='GET', status='200').inc()
Counters are cumulative. To get a rate, you use functions like rate() or irate() in PromQL.
Gauges: Values that can go up or down. Use them for current state: CPU usage, memory usage, queue depth, active connections.
from prometheus_client import Gauge
active_connections = Gauge('active_connections', 'Number of active connections')
# Set gauge value
active_connections.set(42)
Gauges represent a point-in-time value. You can query them directly without rate functions.
Histograms: Distributions of values. Use them for things like request latency, response sizes, processing times.
from prometheus_client import Histogram
request_duration = Histogram(
'http_request_duration_seconds',
'HTTP request duration',
buckets=[0.1, 0.5, 1.0, 2.5, 5.0, 10.0]
)
# Record duration
with request_duration.time():
process_request()
Histograms create multiple time series: one for the sum, one for the count, and one for each bucket. This allows you to calculate percentiles.
Summaries: Similar to histograms but calculated on the client side. They provide quantiles directly.
from prometheus_client import Summary
request_duration = Summary(
'http_request_duration_seconds',
'HTTP request duration'
)
# Record duration
request_duration.observe(0.5)
Summaries are less flexible than histograms (quantiles are calculated client-side) but more efficient.
Choosing the Right Metric Type
I've seen teams use the wrong metric type, making queries difficult or impossible. Here's my rule of thumb:
- Counters: For cumulative values (requests, errors, bytes)
- Gauges: For current state (CPU, memory, queue depth)
- Histograms: For distributions when you need flexibility
- Summaries: For distributions when you know the quantiles you need
Logs: Discrete Events with Context
Logs capture discrete events with full context. They're verbose but essential for debugging.
Structured Logging: Making Logs Useful
Unstructured logs are hard to search and analyze. Structured logs are machine-readable and queryable:
import structlog
import json
logger = structlog.get_logger()
logger.info(
"request_completed",
method="GET",
path="/api/users",
status_code=200,
duration_ms=45,
user_id="user-123",
request_id="req-abc123"
)
This produces JSON that's easy to parse and query:
{
"timestamp": "2025-09-22T10:15:30Z",
"level": "INFO",
"event": "request_completed",
"method": "GET",
"path": "/api/users",
"status_code": 200,
"duration_ms": 45,
"user_id": "user-123",
"request_id": "req-abc123"
}
Log Levels: Use Them Correctly
Log levels indicate severity, but I've seen teams misuse them:
- DEBUG: Detailed diagnostic information. Only enable in development or when debugging specific issues.
- INFO: General informational messages about normal operation. Use for important events.
- WARN: Warning messages about potentially harmful situations. The system continues to operate.
- ERROR: Error events that might still allow the system to continue. Use for exceptions that are handled.
- FATAL: Critical errors that cause the system to abort. Rarely used in practice.
I've seen teams log everything at INFO level, making it impossible to find important messages. Use log levels appropriately.
Centralized Logging: Aggregating from Many Sources
In cloud-native systems, logs come from many sources: containers, services, infrastructure. You need to aggregate them:
Fluentd vs. Fluent Bit
Both collect and forward logs. Fluent Bit is lighter and faster, designed for containerized environments. Fluentd has more plugins but higher resource usage.
I use Fluent Bit for Kubernetes clusters:
apiVersion: v1
kind: ConfigMap
metadata:
name: fluent-bit-config
data:
fluent-bit.conf: |
[INPUT]
Name tail
Path /var/log/containers/*.log
Parser docker
Tag kube.*
Refresh_Interval 5
[OUTPUT]
Name loki
Match kube.*
Url http://loki:3100/api/prom/push
Log Storage: Choosing the Right Solution
Different log storage solutions have different trade-offs:
- Elasticsearch: Powerful but resource-intensive. Good for complex queries.
- Loki: Prometheus-style log aggregation. Efficient and integrates with Grafana.
- CloudWatch Logs: AWS-native. Easy to set up but can be expensive at scale.
- S3 + Athena: Cost-effective for long-term storage and analysis.
I use Loki for recent logs (last 30 days) and S3 + Athena for long-term storage and compliance.
Traces: Understanding Request Flow
Distributed tracing shows how requests flow through your system. It's essential for microservices architectures.
OpenTelemetry: The Standard
OpenTelemetry is becoming the standard for observability. It provides a single API for metrics, logs, and traces:
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
# Set up tracing
trace.set_tracer_provider(TracerProvider())
tracer = trace.get_tracer(__name__)
# Create span
with tracer.start_as_current_span("process_order") as span:
span.set_attribute("order_id", order_id)
span.set_attribute("amount", amount)
# Process order
process_order(order_id)
span.set_status(trace.Status(trace.StatusCode.OK))
Trace Context Propagation: Following Requests
For tracing to work, trace context must propagate through services:
from opentelemetry.propagate import inject, extract
from opentelemetry.trace.propagation.tracecontext import TraceContextTextMapPropagator
# In service A (sending request)
carrier = {}
inject(carrier)
headers = {k: v for k, v in carrier.items()}
# Send request with headers
response = requests.post(url, headers=headers)
# In service B (receiving request)
carrier = dict(request.headers)
context = extract(carrier)
trace.set_tracer_provider(TracerProvider())
tracer = trace.get_tracer(__name__)
with tracer.start_as_current_span("handle_request", context=context):
# Handle request
pass
Sampling: The Performance Trade-off
Tracing every request can be expensive. Use sampling:
from opentelemetry.sdk.trace.sampling import TraceIdRatioBased
# Sample 10% of traces
sampler = TraceIdRatioBased(0.1)
trace.set_tracer_provider(TracerProvider(sampler=sampler))
I use:
- 100% sampling for errors
- 1-10% sampling for successful requests
- Adaptive sampling based on load
Prometheus: The Metrics Backbone
Prometheus is the de facto standard for metrics in cloud-native systems. Understanding it is essential.
Prometheus Architecture
Prometheus works by scraping metrics from targets:
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
Service Discovery
Prometheus discovers targets through service discovery. For Kubernetes, this means it automatically finds pods to scrape based on annotations.
The Pull Model
Prometheus uses a pull model—it scrapes metrics from targets. This is different from push-based systems where targets send metrics.
The pull model has advantages:
- Targets don't need to know about Prometheus
- Prometheus controls scrape frequency
- Easier to add/remove targets
But it also has limitations:
- Targets must be reachable from Prometheus
- Short-lived targets might be missed
- No authentication by default
PromQL: Querying Metrics
PromQL is Prometheus's query language. It's powerful but has a learning curve:
Basic Queries
# Get current value
http_requests_total
# Get rate over 5 minutes
rate(http_requests_total[5m])
# Filter by labels
http_requests_total{method="GET", status="200"}
# Aggregate
sum(http_requests_total) by (method)
Common Patterns
I use these patterns frequently:
# Error rate
rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m])
# 95th percentile latency
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
# CPU utilization
100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
Recording Rules: Pre-computing Expensive Queries
Recording rules pre-compute expensive queries:
groups:
- name: app_rules
interval: 30s
rules:
- record: app:http_requests:rate5m
expr: rate(http_requests_total[5m])
- record: app:http_errors:rate5m
expr: rate(http_requests_total{status=~"5.."}[5m])
Use recording rules for:
- Expensive aggregations
- Frequently used queries
- Cross-metric calculations
The Cardinality Problem
High cardinality is Prometheus's biggest challenge. Each unique combination of label values creates a new time series:
# BAD: High cardinality
request_count.labels(user_id=user_id).inc() # Creates series per user
# GOOD: Low cardinality
request_count.labels(method=method, status=status).inc() # Limited combinations
I've seen Prometheus instances crash due to high cardinality. Avoid labels with high cardinality (user IDs, request IDs, etc.).
Alerting: Knowing When Things Go Wrong
Alerting is how you know when things go wrong. But alerting is hard—too many alerts and you ignore them all, too few and you miss problems.
Alert Rules: Defining What to Alert On
Alert rules define when to send alerts:
groups:
- name: application
rules:
- alert: HighErrorRate
expr: |
(
rate(http_requests_total{status=~"5.."}[5m]) /
rate(http_requests_total[5m])
) > 0.05
for: 5m
labels:
severity: warning
annotations:
summary: "High error rate detected"
description: "Error rate is {{ $value | humanizePercentage }} for 5 minutes"
- alert: HighLatency
expr: |
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 1
for: 10m
labels:
severity: warning
annotations:
summary: "High latency detected"
description: "95th percentile latency is {{ $value }}s"
Alert Best Practices
- Alert on symptoms, not causes: Alert on high error rate, not on individual errors
- Use meaningful thresholds: Base thresholds on SLOs, not arbitrary values
- Add context: Include enough information to understand and fix the issue
- Avoid alert fatigue: Don't alert on everything
Alert Severity
I use three severity levels:
- Critical: Immediate action required. System is down or severely degraded.
- Warning: Attention needed soon. System is degraded but functional.
- Info: Informational. No immediate action needed.
Alert Routing: Getting Alerts to the Right People
Route alerts based on severity and team:
route:
receiver: 'default'
routes:
- match:
severity: critical
receiver: 'oncall'
continue: true
- match:
team: 'backend'
receiver: 'backend-team'
Notification Channels
Use multiple channels:
- PagerDuty: For critical alerts requiring immediate response
- Slack: For team notifications and non-critical alerts
- Email: For summary reports and low-priority alerts
Reducing Alert Fatigue
Alert fatigue is real. I've seen teams with hundreds of alerts, most of which are ignored. Here's how to reduce it:
- Tune thresholds carefully: Base them on actual behavior, not guesses
- Use alert grouping: Group related alerts together
- Implement alert suppression: Suppress alerts during maintenance windows
- Regular alert review: Review and remove unnecessary alerts quarterly
I review alerts monthly. If an alert hasn't fired in 3 months, I question whether it's needed.
Dashboards: Visualizing Your System
Dashboards are how you understand your system at a glance. But good dashboards are hard to create.
Grafana: The Visualization Tool
Grafana is the standard for visualizing Prometheus metrics. It's powerful but requires thought to use effectively.
Dashboard Design Principles
- Focus on key metrics: Don't try to show everything
- Use appropriate visualizations: Line graphs for trends, gauges for current state
- Group related metrics: Organize by service or function
- Keep it simple: Complex dashboards are hard to understand
Common Dashboard Panels
I include these panels in every dashboard:
- Request rate:
rate(http_requests_total[5m]) - Error rate:
rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) - Latency percentiles:
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) - Resource utilization: CPU, memory, disk, network
Dashboard Variables
Use variables to make dashboards reusable:
# Variable: service
# Query: label_values(http_requests_total, service)
# Use in query
rate(http_requests_total{service="$service"}[5m])
This allows one dashboard to work for all services.
SLO and SLI: Defining Reliability
Service Level Objectives (SLOs) define reliability targets. Service Level Indicators (SLIs) measure reliability.
Defining SLIs
SLIs measure reliability from the user's perspective:
- Availability: Uptime percentage
- Latency: Response time percentiles
- Throughput: Requests per second
- Error rate: Error percentage
slis:
availability:
query: |
(
sum(rate(http_requests_total{status!~"5.."}[5m])) /
sum(rate(http_requests_total[5m]))
)
target: 0.999 # 99.9%
latency:
query: |
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
target: 0.5 # 500ms
Setting SLOs
SLOs should be:
- Achievable: Based on current performance
- Measurable: Can be tracked with metrics
- User-focused: Based on user experience
- Documented: Clear and communicated
I set SLOs based on:
- Current performance (baseline)
- Business requirements
- User expectations
- Cost of achieving higher SLOs
Error Budgets
Error budgets are how much unreliability you can tolerate:
error_budget:
window: 30d
target: 99.9% # 0.1% error budget
actual: 99.95% # 0.05% used
remaining: 0.05% # 0.05% remaining
Use error budgets to make decisions:
- High error budget: Can take risks, deploy frequently
- Low error budget: Be conservative, focus on stability
Incident Response: When Things Go Wrong
Observability is most valuable during incidents. Here's how to use it effectively.
Runbooks: Operational Procedures
Runbooks document common issues and solutions:
# High Error Rate
## Symptoms
- Error rate > 5%
- Increased latency
- User complaints
## Investigation
1. Check error logs: `kubectl logs -f deployment/api`
2. Check metrics: `rate(http_requests_total{status=~"5.."}[5m])`
3. Check traces: Filter by error status
## Common Causes
- Database connection pool exhausted
- Downstream service failure
- Resource exhaustion
## Resolution
1. Scale up if resource exhaustion
2. Check downstream services
3. Restart if needed
Maintain runbooks for common issues. They save time during incidents.
Post-Incident Reviews
After incidents, conduct post-incident reviews:
- Timeline: What happened and when
- Root cause: Why it happened
- Impact: Who was affected and how
- Resolution: How it was fixed
- Action items: How to prevent it
I've learned more from post-incident reviews than from any other source. They're essential for improving reliability.
Conclusion
Observability is essential for operating cloud-native systems effectively. But it's not just about tools—it's about understanding your system and using that understanding to improve it.
Start with the basics:
- Metrics for quantitative understanding
- Logs for detailed context
- Traces for request flow
Then build on that foundation:
- Comprehensive alerting
- Useful dashboards
- SLOs and error budgets
Remember: observability is a journey, not a destination. Keep learning, keep improving, and keep iterating. Your observability will get better over time.
The most important lesson I've learned? You can't improve what you can't measure. Invest in observability early—it pays dividends when things go wrong.