Getting Started with AWS Architecture
After spending over a decade architecting solutions on AWS, I've learned that building robust cloud architectures isn't just about knowing which services to use—it's about understanding how they work together, when to use them, and more importantly, when not to. This guide distills lessons from countless production deployments, late-night incident responses, and architectural reviews.
The Reality of Cloud Architecture
Let me start with something that took me years to fully appreciate: AWS isn't a magic solution that makes everything easier. It's a powerful toolkit that, when used correctly, can transform how you build and operate systems. But misuse it, and you'll find yourself with a complex, expensive mess that's harder to manage than your old on-premises infrastructure.
The key difference between a good AWS architect and a great one? The great ones know when to say "no" to a service, when to keep things simple, and when complexity is actually justified.
Core AWS Services: Beyond the Basics
When I first started with AWS, I thought understanding EC2, S3, and RDS was enough. I was wrong. Here's what I've learned about each service and the gotchas that aren't in the documentation.
EC2: More Than Just Virtual Servers
EC2 seems straightforward—spin up a virtual machine, install your app, done. But here's what the documentation doesn't tell you:
Instance Types Matter More Than You Think
I once spent weeks debugging performance issues in a production application, only to discover we were using t3.micro instances for a CPU-intensive workload. The burstable performance model means you get baseline performance, but when you need more, you consume CPU credits. Run out of credits, and your application grinds to a halt.
The lesson? Always match instance types to your workload characteristics:
- General Purpose (M5, M6i): Good defaults for most applications
- Compute Optimized (C5, C6i): CPU-intensive workloads, batch processing
- Memory Optimized (R5, R6i): Databases, in-memory caches, analytics
- Storage Optimized (I3, I4i): NoSQL databases, data warehousing
Placement Groups: The Hidden Performance Feature
If you're running a distributed system where nodes need low-latency communication, placement groups are your friend. I've seen Cassandra clusters where moving from random placement to a cluster placement group reduced p99 latency by 40%. The trade-off? Less fault tolerance—if an availability zone fails, your entire cluster might be affected.
resource "aws_placement_group" "app_cluster" {
name = "app-cluster-pg"
strategy = "cluster" # or "partition" or "spread"
}
resource "aws_instance" "app" {
placement_group = aws_placement_group.app_cluster.name
# ... other configuration
}
S3: It's Not Just File Storage
S3 is deceptively simple. Drop files in, retrieve them later. But I've seen teams burn thousands of dollars monthly because they didn't understand how S3 actually works.
Storage Classes: The Cost-Performance Trade-off
The default STANDARD storage class is expensive for data you rarely access. I once helped a client reduce their S3 costs by 70% by implementing a lifecycle policy that moved old logs to GLACIER after 30 days and DEEP_ARCHIVE after 90 days.
{
"Rules": [
{
"Id": "MoveOldLogsToGlacier",
"Status": "Enabled",
"Prefix": "logs/",
"Transitions": [
{
"Days": 30,
"StorageClass": "GLACIER"
},
{
"Days": 90,
"StorageClass": "DEEP_ARCHIVE"
}
]
}
]
}
The Request Pricing Trap
Here's something that caught me off guard: S3 charges per request, and the pricing varies dramatically by storage class. Retrieving data from GLACIER costs $0.01 per 1,000 requests, but retrieving from DEEP_ARCHIVE costs $0.05 per 1,000 requests. If you're building a system that needs frequent access to archived data, you're better off using INTELLIGENT_TIERING or keeping data in STANDARD_IA.
Versioning and Lifecycle: A Double-Edged Sword
I enabled versioning on an S3 bucket once without setting up lifecycle policies. Three months later, we had 50TB of storage, mostly old versions of files. The bill was... unpleasant. Always pair versioning with lifecycle policies that delete non-current versions after a reasonable period.
RDS: Managed Databases Done Right
RDS removes a lot of operational burden, but it also removes flexibility. Here's what I wish I'd known earlier:
Multi-AZ: Not Just for High Availability
Everyone knows Multi-AZ provides failover capabilities, but it also provides a hidden benefit: automated backups don't impact primary instance performance because they're taken from the standby. For a database under heavy load, this alone can justify the cost.
Parameter Groups: The Overlooked Performance Tool
I spent a month optimizing a MySQL RDS instance, tweaking queries, adding indexes. Then I discovered the default parameter group had innodb_buffer_pool_size set to only 25% of available memory. Adjusting this single parameter improved query performance by 3x.
Read Replicas: Use Them Strategically
Read replicas aren't free scaling. Each replica costs the same as the primary instance. I've seen teams create read replicas "just in case" and then forget about them, burning money. Only create replicas if you actually have read-heavy workloads or need cross-region replication.
VPC: Your Network Foundation
VPC networking is where many architects stumble. The concepts are simple, but the interactions are complex.
CIDR Block Planning: Think Big
I've seen teams start with a /24 CIDR block (256 IPs) thinking it's plenty, only to run out of addresses when they need to add more subnets or scale. Start with at least a /16 (65,536 IPs) for production VPCs. You can always use smaller subnets within it.
NAT Gateway Costs: The Silent Killer
NAT Gateways cost $0.045 per GB of data processed, plus $0.045 per hour. For a system processing 10TB monthly, that's $450 just for NAT. I've seen teams use NAT Gateways for internal traffic that doesn't need internet access. Use VPC endpoints for AWS services, and only use NAT Gateways when you truly need outbound internet access.
Security Groups: Stateful Firewalls
Security groups are stateful, which means if you allow outbound traffic on port 443, the return traffic is automatically allowed. This is great for security, but it can be confusing when troubleshooting. Always test connectivity from both directions.
Lambda: Serverless Done Right
Lambda is powerful, but it's not a silver bullet. Here's what I've learned:
Cold Starts: The Real Problem
Cold starts aren't just about the first request—they're about the entire execution context. I've seen teams optimize package size, use provisioned concurrency, but still have issues because they're making external API calls during initialization. Move initialization outside the handler when possible.
Memory Allocation: It's Not Just About Memory
Lambda allocates CPU proportionally to memory. A 512MB function gets half the CPU of a 1024MB function. I've seen teams increase memory not because they needed it, but because they needed more CPU. This works, but it's expensive. Consider using Fargate for CPU-intensive workloads.
Timeout Configuration: Set Realistic Limits
The default 3-second timeout is rarely enough. But setting it to the maximum 15 minutes can mask problems. I once debugged a Lambda that was timing out after 14 minutes because it was processing records sequentially instead of in parallel. A shorter timeout would have made the problem obvious immediately.
Design Principles That Actually Matter
High Availability: Design for Failure
The AWS Well-Architected Framework says "design for failure," but what does that actually mean in practice?
Availability Zones: Not Created Equal
Not all availability zones are equal. Some have more capacity, some have better network connectivity. I've seen applications where traffic was unevenly distributed across AZs, causing one AZ to be overloaded while others were underutilized. Use cross-zone load balancing, and monitor per-AZ metrics.
Stateless Applications: Easier Said Than Done
Everyone says "make your application stateless," but session data, file uploads, and caching can make this challenging. I've helped teams move from sticky sessions (which break high availability) to distributed session stores using ElastiCache or DynamoDB. The migration wasn't trivial, but it was necessary.
Auto-Scaling: It's Not Set and Forget
Auto-scaling groups are powerful, but they require tuning. I've seen teams set up auto-scaling with default CloudWatch alarms, only to discover their application scales up too slowly during traffic spikes and scales down too aggressively during quiet periods.
The key metrics to monitor:
- CPU Utilization: Good for steady workloads, but can lag for bursty traffic
- Request Count: Better for web applications with variable request complexity
- Custom Metrics: Best when you understand your application's scaling characteristics
resource "aws_autoscaling_policy" "scale_up" {
name = "scale-up-policy"
scaling_adjustment = 2
adjustment_type = "ChangeInCapacity"
cooldown = 300
autoscaling_group_name = aws_autoscaling_group.app.name
}
resource "aws_cloudwatch_metric_alarm" "cpu_high" {
alarm_name = "cpu-high"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 2
metric_name = "CPUUtilization"
namespace = "AWS/EC2"
period = 120
statistic = "Average"
threshold = 70
alarm_actions = [aws_autoscaling_policy.scale_up.arn]
}
Security First: It's Not Optional
Security in the cloud is different from on-premises. Here's what I've learned:
IAM: The Foundation of Security
IAM is powerful but complex. I've seen teams create IAM users with admin access "just to get things working," then forget to remove those permissions. Use IAM roles for EC2 instances, Lambda functions, and ECS tasks. Never embed access keys in code.
The Principle of Least Privilege: Actually Do It
I once audited an IAM policy that had s3:* permissions on * resources. When I asked why, the answer was "we might need it later." That's not how security works. Start with the minimum permissions needed, and add more only when you have a specific use case.
Encryption: At Rest and In Transit
AWS makes encryption easy, but it's not enabled by default everywhere. I've seen teams forget to enable encryption on RDS instances, only to discover it during a security audit. Enable encryption from day one—it's harder to add later.
Network Security: Defense in Depth
Security groups are your first line of defense, but they're not enough. Use network ACLs for additional protection, enable VPC Flow Logs to monitor traffic, and consider AWS WAF for web applications. I've caught several security issues by analyzing VPC Flow Logs that showed unexpected traffic patterns.
Cost Optimization: It's a Continuous Process
AWS costs can spiral out of control if you're not careful. Here's what I've learned:
Right-Sizing: It's Not Just About Instance Types
Right-sizing isn't just choosing the right instance type—it's about understanding your actual resource needs. I've helped teams reduce costs by 40% by analyzing CloudWatch metrics and moving from over-provisioned instances to appropriately sized ones.
Use AWS Cost Explorer and Trusted Advisor, but also build your own dashboards. I track cost per customer, cost per request, and cost trends over time. This helps identify cost anomalies early.
Reserved Instances: When They Make Sense
Reserved instances can save up to 75% compared to on-demand, but they're a commitment. I've seen teams buy reserved instances for workloads that change frequently, only to realize they're not using them effectively.
Buy reserved instances for:
- Steady-state workloads with predictable usage
- Applications that run 24/7
- Workloads you understand well
Avoid reserved instances for:
- Development and testing environments
- Workloads with unpredictable usage patterns
- Applications you're planning to migrate or refactor
Spot Instances: High Risk, High Reward
Spot instances can save up to 90%, but they can be terminated with two minutes' notice. I've used spot instances successfully for:
- Batch processing jobs
- CI/CD build agents
- Development and testing environments
- Stateless web servers behind load balancers
The key is designing your application to handle interruptions gracefully.
Architecture Patterns: Real-World Applications
Three-Tier Architecture: Still Relevant
The three-tier architecture is classic for a reason—it works. But here's how to do it right on AWS:
Presentation Tier: ALB + Auto Scaling
Use Application Load Balancer (ALB) instead of Classic Load Balancer. ALB supports path-based and host-based routing, which is essential for microservices. I've seen teams use Classic Load Balancer because it's cheaper, only to migrate later when they need advanced routing.
Application Tier: Stateless and Scalable
Your application tier should be stateless. Store session data in ElastiCache or DynamoDB, not in application memory. I've helped teams migrate from in-memory sessions to Redis, which enabled proper auto-scaling.
Data Tier: Multi-AZ and Read Replicas
Use Multi-AZ for high availability, and read replicas for read-heavy workloads. But don't create read replicas "just in case"—they cost money. Monitor your read/write ratio first.
Serverless Architecture: When It Makes Sense
Serverless isn't right for everything, but when it fits, it's powerful.
API Gateway + Lambda: The Classic Pattern
This pattern works well for REST APIs, but be aware of API Gateway's limitations:
- 29-second timeout
- 10MB payload limit
- Request/response size limits
I've seen teams hit these limits and have to refactor to use ALB + Lambda or move to containers.
Step Functions: Orchestration Made Easy
Step Functions are great for orchestrating Lambda functions, but they can get expensive for high-volume workflows. I once replaced a Step Functions workflow that processed millions of records daily with a simpler SQS + Lambda pattern, reducing costs by 60%.
DynamoDB: NoSQL Done Right
DynamoDB is powerful but has a learning curve. I've seen teams struggle with:
- Hot partitions causing throttling
- On-demand pricing being more expensive than provisioned for steady workloads
- Global tables adding complexity and cost
Start with provisioned capacity, monitor your usage patterns, and switch to on-demand only if your traffic is truly unpredictable.
Best Practices: Lessons from Production
Start with the Well-Architected Framework
The AWS Well-Architected Framework isn't marketing—it's a practical guide based on thousands of customer architectures. I use it in every architecture review. The five pillars (operational excellence, security, reliability, performance efficiency, and cost optimization) provide a comprehensive checklist.
Use Infrastructure as Code
I've seen teams manage infrastructure manually, and it always ends badly. Use Terraform or CloudFormation from day one. I prefer Terraform because of its multi-cloud support and better state management, but CloudFormation is fine if you're AWS-only.
The key is version controlling your infrastructure code and treating it like application code—code reviews, testing, and gradual rollouts.
Implement Comprehensive Monitoring
CloudWatch is good, but it's not enough. I use a combination of:
- CloudWatch for AWS service metrics
- Application-level metrics (Prometheus, Datadog, New Relic)
- Distributed tracing (X-Ray, Jaeger)
- Log aggregation (CloudWatch Logs, ELK stack)
The goal is to have visibility into your system from multiple angles. When something breaks, you want to know immediately, and you want enough context to fix it quickly.
Automate Everything
Manual processes are error-prone and don't scale. Automate:
- Deployments (CI/CD pipelines)
- Infrastructure provisioning (Infrastructure as Code)
- Security scanning (automated vulnerability assessments)
- Cost optimization (automated right-sizing recommendations)
I've seen teams spend hours on manual deployments that could be automated in a day. The upfront investment pays off quickly.
Document Your Architecture
Documentation is often overlooked, but it's critical. I maintain:
- Architecture diagrams (using tools like Lucidchart or Draw.io)
- Runbooks for common operations
- Decision records (why we chose a particular approach)
- Incident post-mortems
Good documentation saves hours during incidents and helps onboard new team members.
Common Pitfalls and How to Avoid Them
Over-Engineering
I've seen teams build complex architectures "just in case" they need the features later. This leads to:
- Higher costs
- More complexity
- Slower development
- Harder maintenance
Start simple, add complexity only when you have a specific need.
Under-Engineering
The opposite problem—building something too simple that doesn't scale. I've seen teams build a single EC2 instance application that works fine in development but falls over in production.
Find the balance. Start with a simple architecture that can scale, then add complexity as needed.
Ignoring Costs
AWS costs can spiral out of control if you're not monitoring them. Set up billing alerts, review costs regularly, and optimize continuously. I've seen teams discover they're spending thousands of dollars on unused resources.
Not Testing Disaster Recovery
High availability and disaster recovery only work if you test them. I've seen teams with perfect disaster recovery plans that failed during actual disasters because they never tested the failover process.
Test your disaster recovery procedures regularly, and document what you learn.
Conclusion
Building on AWS is a journey, not a destination. The services evolve, best practices change, and your understanding deepens over time. Start with the fundamentals, learn from your mistakes, and continuously improve your architecture.
The most important lesson I've learned? There's no perfect architecture—only architectures that are appropriate for your specific use case, constraints, and requirements. Focus on understanding your workload, choose the right services, and iterate based on real-world feedback.
Remember: AWS is a tool, not a solution. Your architecture decisions should be driven by your application's needs, not by what's cool or trendy. Keep it simple, make it work, then optimize.