Technical Blog

Architecting for the Real World: AWS Deep Dives, Production Scaling Patterns, and the AI-Driven Cloud.

The AWS Platform Engineering Roadmap

24 posts · 8 phases

The complete core series — 24 posts across 8 phases, from networking and security through to two real-world case studies. The Generative AI journey continues separately in the Learning in Public series below.

Phase 1 AWS Networking 7 posts Complete Phase 2 Identity & Security 2 posts Complete Phase 3 Containers & Platform 4 posts Complete Phase 4 IaC & GitOps 2 posts Complete Phase 5 Production Deep Dives 3 posts Complete Phase 6 Observability & FinOps 2 posts Complete Phase 7 Advanced Architecture 2 posts Complete Phase 8 Case Studies 2 posts Complete

AWS Networking Series

Seven-part series covering VPC fundamentals through enterprise-grade security architecture.

Demystifying AWS VPC: From Layman to Cloud Architect

Master the foundations of AWS networking. Learn about subnets, route tables, IGWs, and the famous 'Rule of 5' reserved IPs.

April 29, 2026 12 min read

Advanced VPC Concepts: Scaling & High Performance

Dive deep into Transit Gateway, PrivateLink, Global Accelerator, and cross-region peering for enterprise-scale workloads.

April 30, 2026 12 min read

NAT Gateway vs PrivateLink vs VPC Endpoints

Cost & Architecture Trade-offs. A comprehensive comparison of secure internet and service connectivity patterns in AWS.

April 30, 2026 12 min read

Transit Gateway vs VPC Peering — When to Use What

Mesh vs. Hub-and-Spoke. A deep dive into choosing the right connectivity strategy for enterprise-scale AWS environments.

May 1, 2026 25 min read

How DNS Works in AWS: Route 53 & Hybrid Failover

Mastering Route 53 Private Hosted Zones, Resolver Endpoints, and cross-account DNS resolution.

May 3, 2026 20 min read

Direct Connect vs Site-to-Site VPN vs Client VPN

Every enterprise AWS journey eventually reaches the hybrid connectivity question: how do your on-premises systems securely connect to AWS?

May 4, 2026 25 min read

AWS Network Firewall vs Security Groups vs NACLs

DevSecOps angle: Layered defense-in-depth, extending Gateway Load Balancer patterns for enterprise security.

May 6, 2026 22 min read

AWS Security Series

IAM, GuardDuty, WAF, and the defence-in-depth model for regulated AWS environments.

AWS IAM Deep Dive — Roles, Policies, and Zero-Trust at Scale

Building secure, cost-optimised infrastructure. A masterclass on OIDC, Permissions Boundaries, and Zero-Trust identity.

May 7, 2026 25 min read

AWS Security in Depth: GuardDuty, Security Hub & WAF

Beyond prevention: A deep dive into continuous monitoring, automated threat response, and application-layer protection using native AWS security services.

May 8, 2026 25 min read

AWS Containers & Platform Engineering

ECS vs EKS, production cluster design, GitOps, autoscaling, and IaC at enterprise scale.

AWS ECS Fargate vs EKS — When I Used Both and How to Choose

Navigating the container landscape. Choosing the right abstraction level for your enterprise workloads based on scale, cost, and operational overhead.

May 15, 2026 22 min read

Production-Grade EKS Architecture — Multi-Env, Node Groups and Isolation

Production-Grade EKS Architecture: Multi-Env Setup, Node Groups & Isolation Strategy

The full EKS blueprint — three-tier node groups, namespace isolation with NetworkPolicies, RBAC, Helm values hierarchy, and zero-downtime upgrade patterns.

May 16, 2026 25 min read

EKS Ingress ALB NLB Traffic Routing Architecture

Ingress vs ALB vs NLB in EKS — Real Traffic Routing Patterns Explained

The Rabobank hybrid pattern — Nginx+NLB for 33 microservices, ALB for ArgoCD, NLB passthrough for Amazon MQ — with full Terraform, cost comparison, and 7 production anti-patterns.

May 17, 2026 22 min read

Karpenter vs Cluster Autoscaler EKS Architecture

Karpenter vs Cluster Autoscaler — How We Run Both in Production to Maximise EKS Cost Efficiency

Three-tier compute: Cluster Autoscaler for On-Demand baseline, Karpenter for Spot. Complete Terraform, Helm, conflict prevention, interruption handling, and the real cost levers from Rabobank.

May 18, 2026 30 min read

Terraform at Scale Enterprise IaC Structure

Terraform at Scale: Structuring IaC for Enterprise AWS Environments

Layered repos, S3 remote state with DynamoDB locking, opinionated modules, plan-on-PR CI/CD with approval gates, and daily drift detection from a production EKS platform.

May 19, 2026 28 min read

GitOps with ArgoCD Helm on EKS Architecture

GitOps with ArgoCD + Helm on EKS: App of Apps, Sync Waves & Multi-Cluster Strategy

App of Apps bootstrap, ApplicationSet generators, Sync Waves for ordered rollout, multi-cluster management, RBAC with SSO, Image Updater, and Notifications. Full working YAML from a production EKS platform.

May 20, 2026 32 min read

AWS Observability

CloudWatch, X-Ray, OpenTelemetry, and the full observability stack for production EKS platforms.

AWS Observability Stack: CloudWatch, X-Ray, OpenTelemetry & What's Still Missing

The complete observability model for EKS — Container Insights, structured logging with Fluent Bit, distributed tracing with ADOT and X-Ray, Synthetics, and the honest gaps. With the incident that made this post necessary.

May 24, 2026 20 min read

FinOps & Cost Optimisation

Cloud cost ownership from a FinOps Certified Practitioner — tagging strategy, Spot compute, Savings Plans, networking optimisation, and the operating model that makes savings stick.

AWS FinOps in Practice: Cost Optimisation Strategies from a Certified Practitioner

Real-world cost optimisation across EKS, ECS, VPC, and S3 — Karpenter Spot, Fargate Spot, Savings Plans, Gateway Endpoints, gp3 migration, and the FinOps operating model that prevents cost regression.

May 26, 2026 20 min read

Advanced Architecture Patterns

Multi-region, high availability, and global-scale architecture patterns — the decisions that determine whether your platform survives its worst day.

Multi-Region High Availability on AWS: Active-Active vs Active-Passive Design

Route 53 failover routing, Aurora Global Database, DynamoDB Global Tables, TGW peering, Global Accelerator, chaos engineering with FIS, and the decision framework for choosing between Active-Passive and Active-Active.

May 27, 2026 22 min read

Multi-Account AWS Strategy: Landing Zones, Control Tower & Org-Level Networking

AWS Organizations and OU design, Service Control Policies, Control Tower Landing Zones, Account Factory for Terraform, centralised networking with TGW + RAM, and a centralised security account model for governance at scale.

May 28, 2026 24 min read

Production Deep Dives

Standalone posts on specific production problems, incidents, and the non-obvious solutions that came from operating real systems.

Stopping EKS Test Environments: Karpenter, Step Functions & the Race Condition Fix

How we coordinated a multi-phase shutdown of EKS worker nodes and Karpenter-provisioned nodes overnight using AWS Step Functions, Lambda, and tag-based node identification.

May 21, 2026 15 min read

GuardDuty false positive investigation in EKS

When GuardDuty Fires on Your Own Engineer: Investigating a False Positive in EKS

kubectl debug creates a privileged pod indistinguishable from a container escape. The Spot node was already gone when investigation started. Here is the full attribution trail — and three changes so it never takes this long again.

May 29, 2026 12 min read

Nginx Ingress to Kubernetes Gateway API migration

From Nginx Ingress to Kubernetes Gateway API: A Production Cutover Story

Removing Nginx Ingress entirely and serving all traffic through the Gateway API on EKS — path inventory, rewrite translation to filters, external/internal Gateway split, and the war stories: a healthy route silently sending login traffic to the wrong app, OIDC, and WAF.

June 17, 2026 14 min read

Real-World Case Studies

Full production platforms and enterprise migrations, end to end — the real constraints, trade-offs, and decisions behind systems running in regulated environments. Where every building block in this series converges.

Fraud Detection Platform on AWS EKS — 33-Microservice Case Study

Designing a Fraud Detection Platform on AWS EKS — A 33-Microservice Architecture Case Study

A real case study: building a bank's fraud platform on EKS — device risk plus behavioural AI/ML, Databricks integration over an S3 boundary, DORA compliance, dual autoscalers, a Fargate detour, and a 40-minute production cutover at the F5.

June 9, 2026 18 min read

Migrating On-Premises Applications to AWS — The 7Rs Playbook

Migrating On-Premises Applications to AWS: The 7Rs, Real Pitfalls & a Working Playbook

A datacentre-exit war story: applying the 7Rs per workload, lift-and-shift with MGN, rebuilding Jenkins, AWS Workspaces, the hidden DNS dependency, replacing a Sophos firewall with Client VPN — and the honest lessons from doing it largely single-handed.

June 11, 2026 16 min read

Learning in Public: Generative AI for Cloud Engineers

Documenting a 24-week journey through Generative AI — from an infrastructure engineer's perspective.

GenAI with LLMs Week 2 — Fine-Tuning, PEFT, LoRA

GenAI Learning · Week 2

Week 2: Generative AI with Large Language Models — Fine-Tuning, PEFT, LoRA & Model Evaluation

Study notes from Week 2 of the AWS & DeepLearning.AI Generative AI with LLMs course — fine-tuning approaches, LoRA, PEFT, catastrophic forgetting, ROUGE, BLEU, and model evaluation benchmarks. Written from a cloud engineer's perspective.

June 2, 2026 15 min read