Back to Portfolio
Home / Projects / Production-Ready AWS EKS Platform
Amazon EKS GitOps / ArgoCD Karpenter DORA Compliant Databricks

Production-Ready AWS EKS Platform

An event-driven, GitOps-managed Kubernetes platform built for high-throughput transaction processing — 33 microservices, sub-200ms P99 latency, DORA-compliant delivery at Rabobank.

Rabobank, Utrecht
Oct 2023 – Present
Cloud Specialist
Financial Services / Fraud
33
Microservices on EKS
<200ms
P99 Scoring Latency
100%
GitOps Deployments
2
Engineers Trained
DORA
Compliant Platform

Context & Problem Statement

Rabobank needed a new cloud-native platform to run its fraud detection software — a distributed system of 33 microservices performing real-time transaction scoring at sub-second latency. The software vendor had a clear recommendation: Amazon EKS. At the time, EKS was new territory for our team. We had deep AWS and ECS experience from previous work, but Kubernetes at this scale was a different beast entirely.

The platform had to satisfy two competing constraints simultaneously: financial-grade reliability and operational compliance. Rabobank operates under DORA regulation — every change to production infrastructure must be auditable, reversible, and documented. That constraint shaped every architectural decision we made.

Starting Point
The project started with two engineers. My colleague left after three months, leaving me as the sole architect and operator for four to five months. Everything had to be built so that one person could understand, operate, and hand off the entire platform. That constraint turned out to be one of the best design pressures I've ever worked under.

Platform Architecture

The platform runs on a single EKS cluster spanning two availability zones (eu-west-1a and eu-west-1b), with a two-tier node strategy: a Managed Node Group (On-Demand, always-on) bootstraps the cluster and hosts critical system DaemonSets, while Karpenter dynamically provisions application nodes — On-Demand for latency-sensitive scoring services, Spot for batch and ML workloads. ArgoCD drives all deployments via GitOps. Every AWS resource — the EKS cluster, VPC, IAM roles, RDS, Amazon MQ, ECR — is defined in Terraform. No manual console changes, ever.

Pods are distributed across AZs using Kubernetes topology spread constraints configured in Helm — each service runs one replica per zone, so the platform survives an AZ failure without any manual intervention. Keycloak (authentication) runs four pods — two per AZ — to handle the higher session load. Amazon MQ and RDS are each deployed as Multi-AZ pairs with automatic failover.

EKS Fraud Detection Platform — Multi-AZ Architecture (AWS + Kubernetes)

Full platform architecture — Route 53 → NLB → Nginx Ingress → EKS Multi-AZ (Karpenter) → Amazon MQ → RDS Multi-AZ → Databricks / S3 data layer

Technology Stack

Amazon EKS Karpenter ArgoCD Helm Terraform Databricks EC2 GitOps AWS IAM / IRSA Nginx Ingress Amazon MQ Amazon RDS Multi-AZ Keycloak AWS FIS Amazon CloudWatch CloudWatch EKS Addon Splunk AWS VPC k9s

Key Architecture Decisions

Why EKS Over ECS?

The fraud detection software vendor mandated Kubernetes. Our prior work at Rabobank and WCC had been primarily ECS-based, so this was a deliberate step into new territory. The vendor's software assumed Kubernetes-native primitives — custom resources, namespace isolation, pod-level networking — that don't translate cleanly to ECS.

Rather than fight the abstraction, we leaned in. We learned EKS on the job, structured everything for observability from day one, and built documentation as we went so the knowledge wouldn't live in a single person's head.

Trade-off: ECS familiarity vs EKS ecosystem
ECS would have been faster to start — we knew it well. EKS added initial complexity: IRSA for pod-level IAM, CoreDNS tuning, node lifecycle management. The payoff was access to the full Kubernetes ecosystem (Karpenter, ArgoCD, Helm) and a platform the vendor fully supported.

Terraform-First: No Console Changes

Every resource — the EKS cluster, VPC, IAM roles, node pools, Karpenter NodePools, ALB, RDS, ECR — is defined in Terraform. This was non-negotiable from the start, and it paid off repeatedly: when the solo operation period hit, the Terraform state was the single source of truth. No undocumented manual resources, no drift.

Design principle: automation as the only path
If something could be clicked in the AWS console, it had to be coded instead. This sounds like overhead; it's actually how one person can operate a 33-microservice platform without burning out.

Node Strategy: Managed Node Group + Karpenter

The cluster runs two distinct node tiers working together. A Managed Node Group (On-Demand, always-on, one node per AZ) handles cluster bootstrap and hosts system-level DaemonSets that must run on every node and cannot tolerate Spot interruptions: amazon-cloudwatch (Container Insights), the Splunk log forwarder, kube-proxy, and other EKS add-ons. These nodes are small and fixed — their purpose is to make the cluster operational, not to run application workloads.

Karpenter handles everything else. It provisions application nodes dynamically on pending pod events — in seconds, not minutes — using two NodePools:

  • On-Demand NodePool — fraud detection scoring services. Latency-sensitive, sub-200ms p99 SLA. Spot interruption risk is unacceptable here.
  • Spot NodePool — batch processors, background enrichment, and Databricks EC2 ML workers. These jobs tolerate interruption; Karpenter handles graceful migration before reclaim.

This split kept compute costs down significantly while protecting the real-time path. The Managed Node Group ensured the cluster's own observability (CloudWatch, Splunk) was never at risk from a Spot interruption.

High Availability: Pod Topology Spread

Rather than running dedicated "primary" and "standby" pod pairs, we distributed replicas across AZs using Kubernetes topology spread constraints configured in each service's Helm chart. The pattern is simple: maxSkew: 1, topologyKey: topology.kubernetes.io/zone, whenUnsatisfiable: DoNotSchedule. EKS and Karpenter enforce the spread automatically — if eu-west-1a goes down, pods in eu-west-1b continue serving with no manual intervention.

How services are distributed
  • Most services — 1 replica per AZ (2 total). Topology spread constraint in Helm ensures one pod per zone.
  • Keycloak (auth) — 2 replicas per AZ (4 total). Higher replica count handles session load; spread still enforced per zone.
  • Amazon MQ — Active broker in eu-west-1a, Standby in eu-west-1b. Automatic failover on AZ loss.
  • RDS Multi-AZ — Primary in eu-west-1a, synchronous standby in eu-west-1b. Automatic promotion on Primary failure.

Architecture Trade-offs

Every decision on this platform involved a deliberate trade-off. Here are the most significant ones:

Decision What We Chose Benefit Trade-off
Orchestrator EKS over ECS Full Kubernetes ecosystem, vendor-supported, ArgoCD native Higher initial complexity — IRSA, CoreDNS tuning, node lifecycle
Node strategy Managed Node Group + Karpenter Cost efficiency, right-sized nodes, Spot for batch workloads Two node tiers to reason about; Karpenter requires careful NodePool design
Deployment model GitOps (ArgoCD) Full audit trail, DORA compliance, rollback via Git revert Every change must go through Git — no fast hotfixes via kubectl
Compute for scoring On-Demand only for real-time path Eliminates Spot interruption risk for sub-200ms SLA Higher compute cost vs Spot
Ingress Nginx Ingress over AWS ALB Ingress Fine-grained routing control, Helm-native configuration Additional component to manage; ALB is more AWS-native
IaC Terraform-only, no console Single source of truth, reproducible, auditable Every change requires a plan/apply cycle — no quick console fixes

What Happens When Things Fail?

Reliability was not an afterthought — it was tested using AWS Fault Injection Simulator (FIS) before production sign-off. Here is what the platform does under each failure scenario:

Failure Scenario What Happens Recovery Time Tested With
AZ-a unavailable ALB stops routing to AZ-a. Pods in AZ-b continue serving. Karpenter provisions replacement nodes in AZ-b. < 30 seconds AWS FIS AZ outage experiment
Spot node interrupted Karpenter receives 2-minute warning. Pods gracefully migrated to available nodes before reclaim. On-Demand fallback if no Spot available. < 2 minutes AWS FIS Spot interruption
Pod OOM / crash loop Kubernetes restarts the pod automatically. Topology spread constraint ensures other AZ continues serving. < 10 seconds kubectl delete pod + monitoring
RDS Primary failure Multi-AZ standby promoted automatically. Connection string unchanged (CNAME). 60–120 seconds AWS FIS RDS failover
Amazon MQ broker failure Active/Standby failover. Consumers reconnect automatically via failover URL. < 30 seconds AWS FIS + manual broker stop
ArgoCD sync failure Deployment halted. Previous version continues running. Alert fired. Manual review required. No impact to running services Simulated bad Helm values push

The Hard Problem: Helm Values at Scale

Managing Helm chart configuration for 33 microservices across multiple environments — dev, staging, and production — sounds like a bookkeeping problem. It isn't. It's an architectural problem that breaks your deployment model if you get it wrong.

Early on, the naive approach was one values.yaml per service per environment. That's 99+ files before you've deployed anything to production. Changes to a shared configuration (an image registry URL, a resource limit policy) required touching dozens of files. Drift between environments was invisible until something broke in production that worked fine in staging.

What We Built Instead

Values hierarchy: base → environment → service overrides
  • Base values — shared defaults for all services (registry, resource limits, probes)
  • Environment overlay — values that differ per environment (replica counts, resource quotas, secrets refs)
  • Service override — only what's genuinely different per service (image tag, specific env vars)

ArgoCD ApplicationSets drove deployment from this structure — each application declared its own path to the relevant Helm values, while the majority of configuration was inherited from the base and environment layers. A change to the base propagated to all 33 services consistently. Environment-specific values stayed isolated. Service-specific exceptions were immediately visible as deliberate divergences, not hidden drift.

What this solved
Configuration drift between environments dropped to near zero. Onboarding a new microservice became a 10-minute task — write a service override values file with just the divergences, point ArgoCD at it, done. The first time I trained a new engineer on deployments, they were shipping independently within a day.

Databricks Integration

The fraud detection platform has two distinct processing paths: real-time transaction scoring (EKS microservices) and ML model training with batch scoring (Databricks). These run on separate infrastructure and are owned by different teams.

Databricks runs on EC2 worker nodes provisioned across multiple AZs and is owned entirely by the Data Science team — they manage their own jobs, clusters, and model training pipelines without needing Kubernetes expertise. My role was the AWS integration: VPC connectivity, IAM roles for cross-service access, S3 permissions for the data lake, and ensuring the Databricks EC2 workers were correctly placed within the VPC to reach the EKS services they depend on.

The training pipeline ingests historical transaction data from the S3 data lake, trains and validates new fraud model versions, and publishes model artifacts back to S3. The real-time scoring microservices on EKS load model versions from S3 at startup and serve inference on live transactions — targeting the sub-200ms p99 latency requirement.

Architecture split: train in Databricks, serve from EKS
Training is compute-intensive with high latency tolerance (minutes to hours per job). Inference is latency-critical and runs at transaction volume (thousands of requests per second). Splitting them lets each system be right-sized independently — Databricks EC2 workers on Spot for cost efficiency, EKS scoring pods on On-Demand for reliability. The Data Science team gets full autonomy over their tooling without touching Kubernetes.

DORA Compliance Through GitOps

DORA regulation (Digital Operational Resilience Act) requires financial institutions to demonstrate that changes to production systems are controlled, traceable, and reversible. This isn't optional at Rabobank. The audit requirements shaped every layer of the delivery pipeline.

DORA Requirement How We Satisfied It
Change audit trail Every deployment is a Git commit — author, timestamp, what changed, and a PR review before merge. ArgoCD records the sync state for every deployment.
Rollback capability GitOps means rollback is a Git revert. ArgoCD can roll back to any previous sync state in seconds. No manual steps, no console access required.
Change approval Pull request required for any change to infrastructure or application config. ArgoCD only syncs what Git says — no direct kubectl apply to production.
Environment segregation Namespace isolation in EKS. Separate Terraform workspaces per environment. ArgoCD projects scoped by environment with RBAC.
Resilience testing AWS Fault Injection Simulator (FIS) was used to run structured disaster recovery scenarios as required by DORA — AZ failover, pod termination, node drain, RDS primary failover, and Amazon MQ broker failover. Each scenario was run in staging first, with documented pass/fail evidence, before sign-off for production.

The key insight is that GitOps and DORA are naturally aligned. GitOps gives you the audit trail, the approval gates, and the rollback mechanism that compliance requires — not as bolt-on process, but as the default behaviour of the system.

Lessons Learned

1
Learning a new technology in production forces depth
EKS was new when we started. There was no time for a sandbox phase — we built the production platform as we learned. That pressure is uncomfortable, but it produces a different quality of understanding than coursework. When you debug a Karpenter scheduling failure at 11pm with no one to ask, you learn the system properly.
2
Sole ownership clarifies priorities fast
Operating solo for four months meant every decision had to pass a single test: "can I understand, debug, and explain this at 2am?" Anything unnecessarily complex got simplified. The platform that came out of that period was more operable than if it had been built by a full team from the start, precisely because there was no one else to absorb the complexity.
The Incident That Shaped Our Observability Strategy

Early in the platform's life, a newly onboarded microservice entered a crash loop due to a misconfiguration introduced during its initial setup. The pod was failing — but no alarm fired. No alert reached the team. The service sat degraded in silence until someone noticed it manually.

The root cause was straightforward in hindsight: the CloudWatch log group for that service had not been added to the alarm configuration. The observability stack was monitoring the log groups that existed at the time it was built — and this service was added after. The monitoring had not been extended to cover it.

The pivot: We rebuilt the alarm strategy with three changes.

First, pod-level failure detection moved from log-based alarms to Container Insights metric alarms on pod_number_of_container_restarts — a metric automatically available for every pod in the cluster without any log group configuration. No new service can be invisible to this alarm.

Second, all alarm notifications were routed to multiple channels simultaneously — CloudWatch → SNS → Teams #platform-alerts. A single notification channel is a single point of failure in your incident response.

Third, any alarm firing in production automatically raises a Change Request in the ITSM tool. The incident is not just a notification — it is a tracked work item with an owner, a timeline, and a resolution record. Every production event is handled as a formal CR, ensuring nothing is resolved off the books and audit trails are complete.

The service was down for longer than it should have been. The platform was better for it.

3
k9s changed how I debug Kubernetes
Early Kubernetes troubleshooting via raw kubectl was slow — switching between describe, logs, exec, and event streams across namespaces takes time when you're hunting a live incident. k9s's interactive TUI collapsed that workflow dramatically. Watching pod restarts, log streaming, and event correlation in a single view is a meaningful productivity difference in a 33-service cluster.
4
Documentation written during solo operation is the best documentation
When I eventually onboarded two new engineers, the runbooks I had written for myself — during the solo months, under pressure — were the training material. They were accurate because I had written them for a real audience: future me, who might be tired and under pressure. Both engineers were deploying independently within a week of joining.
5
Helm values structure is an architecture decision, not a config task
How you organise Helm values across environments determines your operational toil for the life of the platform. Getting the base/environment/service-override hierarchy right early saved hundreds of hours of configuration drift debugging later. Treat it like you'd treat your Terraform module structure — design it upfront, refactor it deliberately.

What I Would Do Differently

If I were starting this platform from scratch today, these are the decisions I would make differently — not because the original decisions were wrong, but because experience reveals what matters earlier than you expect.

1
Define the namespace and RBAC model before any team deploys
The namespace structure evolved organically in the first few weeks — a natural consequence of building a platform while other teams are actively trying to use it. The cost of retrofitting namespace isolation, ResourceQuotas, and RBAC boundaries onto a running cluster is significantly higher than designing those boundaries upfront. Even a rough RBAC model written on a whiteboard before the cluster is provisioned saves weeks of careful migration work later. Define who owns which namespace, what quotas apply, and what permissions each team has before the first workload is deployed — not after.
2
Start with the two-cluster model (dev + prod) from the beginning
The initial architecture used a single cluster with namespace isolation between environments. It worked — until it didn't. A noisy dev workload affecting staging, a cluster upgrade that impacted all environments simultaneously, and audit questions about prod/non-prod separation eventually drove a migration to dedicated clusters. The migration with live workloads was non-trivial. Starting with separate clusters from Day 1 costs one extra control plane charge (~$73/month) and saves a significant amount of future migration work. In a regulated environment, the prod/non-prod separation question will always be asked. Answer it in the architecture, not in a meeting.
3
Invest in Helm value hierarchy design before the first chart
The three-layer Helm values pattern — base, environment override, service override — was not in place from the start. Early services had environment-specific values scattered with no consistent pattern. Standardising this retroactively required touching every chart in the repository simultaneously. The standard should have been defined and documented before the first chart was written, not extracted from 10 charts that had already diverged from each other.
4
Plan ACM certificate coverage before the first service goes external
We used ACM certificate ARNs attached at the NLB for both internal and external Nginx Ingress — the right pattern. What we did not do upfront was provision wildcard certificates for all internal domain patterns before services started needing external access. Individual certificates were requested per service as each one needed to be exposed, creating a manual process that produced gaps when a service needed to be reachable quickly. Provisioning wildcard ACM certificates for all expected domain patterns on Day 1 and referencing them consistently via the NLB annotation across all Ingress resources is the correct starting point. One certificate, one annotation value, consistent from the start.
5
Test Spot interruption handling explicitly before enabling Spot
Spot was enabled for batch workloads with the assumption that applications handled SIGTERM gracefully. Most did. A small number did not — they had no preStop hook and no graceful shutdown logic. The first Spot interruption event on those services caused dropped connections that were difficult to reproduce in a test environment. Interruption handling should be a mandatory checklist item verified for every service before Spot is enabled in production, not an assumption validated by a live interruption event.
6
Document the "why" in real time, not retrospectively
The Terraform code and Helm charts are well-documented. The architectural decision records — why a specific networking pattern was chosen, why the scoring service runs On-Demand while enrichment runs Spot, why Nginx Ingress was chosen over the ALB Ingress Controller — were written months after the decisions were made. Memory is imperfect. Decisions that seemed obvious at the time required reconstruction. Writing a one-paragraph decision record at the moment of the decision costs ten minutes. Reconstructing it six months later costs an hour and is never quite accurate.

Outcomes & Impact

33
Microservices in Production
<200ms
P99 Scoring Latency
2
Engineers Trained on K8s + GitOps
100%
IaC Coverage (Terraform)
DORA
Regulation Compliant