GitOps with ArgoCD + Helm on EKS: App of Apps, Sync Waves & Multi-Cluster Strategy

May 20, 2026 • 32 min read GitOps Kubernetes

AWS Series | Part 15 — Building secure, cost-optimised, cloud-native infrastructure on AWS.

GitOps with ArgoCD Helm on EKS Architecture

TL;DR

Concept	What It Solves	When You Need It
ArgoCD Application	Deploys one service from Git	Every service
App of Apps	Bootstraps all Applications from one root App	Day 1 — before you have more than 3 services
ApplicationSet	Generates Applications dynamically — one template, many targets	Multi-env, multi-cluster, multi-service
Sync Waves	Orders resource creation within a sync — databases before apps	Any service with dependencies
Sync Hooks	Runs jobs at specific points — migrations, smoke tests	Database migrations, post-deploy validation
ArgoCD Projects	Scopes which repos/clusters/namespaces an Application can touch	Multi-team, RBAC, compliance
Multi-cluster	Manages applications across prod + dev clusters from one ArgoCD	Two+ clusters in your organisation
Notifications	Alerts on sync success/failure/degraded	Production — always

Introduction — IaC Builds the Platform, GitOps Runs the Deployments

Blog 14 covered Terraform at scale — how to structure IaC that provisions the EKS cluster, VPC, IAM roles, and Karpenter. This post covers what runs on top of that infrastructure: the GitOps layer that deploys and manages your applications.

The distinction matters. Terraform is state-reconciliation — it runs when you tell it to. GitOps is continuous reconciliation — it runs forever, comparing your cluster's actual state to Git's desired state and correcting any drift automatically. Terraform builds the stage. ArgoCD manages what performs on it.

On the production EKS platform managing 33 microservices, ArgoCD is the single deployment mechanism for everything running in the cluster — applications, Helm releases, Karpenter NodePools, monitoring stack, cert configurations. Nothing is deployed via kubectl apply in production. Nothing is deployed via a Helm CLI command by hand. Every change is a Git commit. Every rollback is a Git revert. Every deployment has an author, a timestamp, and a diff.

This post covers the full ArgoCD production setup: App of Apps bootstrap, ApplicationSets for multi-environment deployments, Sync Waves for ordered rollout, multi-cluster management, RBAC with SSO, and notifications. Everything has working YAML.

1. ArgoCD Installation — Production-Grade Helm Setup

Why Helm for ArgoCD, Not the Raw Manifest

The official ArgoCD install.yaml is fine for getting started. For production, it gives you no control over resource requests, replica counts, HA configuration, or SSO integration. The Helm chart exposes all of these as values.

resource "helm_release" "argocd" {
  name             = "argocd"
  repository       = "https://argoproj.github.io/argo-helm"
  chart            = "argo-cd"
  version          = "7.3.11"
  namespace        = "argocd"
  create_namespace = true

  # Wait for all pods to be ready before Terraform marks the release healthy
  wait    = true
  timeout = 600

  values = [yamlencode({
    global = {
      # Image tag — pin to a specific version, not latest
      image = { tag = "v2.11.3" }
    }

    configs = {
      params = {
        # Disable insecure mode — always use TLS
        "server.insecure" = false
        # Enable exec into pods from ArgoCD UI
        "server.enable.gzip" = true
      }

      cm = {
        # Disable local admin user — SSO only in production
        "admin.enabled" = "false"

        # SSO via Okta/Azure AD
        "oidc.config" = yamlencode({
          name         = "Okta"
          issuer       = var.okta_issuer_url
          clientID     = var.okta_client_id
          clientSecret = "$oidc.okta.clientSecret"
          requestedScopes = ["openid", "profile", "email", "groups"]
          requestedIDTokenClaims = {
            groups = { essential = true }
          }
        })

        # Resource tracking — annotation-based (default, most compatible)
        "application.resourceTrackingMethod" = "annotation"

        # Status badge — allows embedding deployment status in GitHub PRs
        "statusbadge.enabled" = "true"

        # Timeout for resource health checks
        "timeout.reconciliation" = "180s"
      }

      # RBAC — maps SSO groups to ArgoCD roles
      rbac = {
        "policy.default" = "role:readonly"
        "policy.csv" = <<-EOT
          # Platform team — full admin
          g, platform-engineers, role:admin

          # Dev teams — can sync their own apps, cannot modify cluster-scoped resources
          g, orders-team, role:orders-deploy
          g, payments-team, role:payments-deploy

          # Custom role — deploy to orders namespace only
          p, role:orders-deploy, applications, get, orders/*, allow
          p, role:orders-deploy, applications, sync, orders/*, allow
          p, role:orders-deploy, applications, action/*, orders/*, allow
          p, role:orders-deploy, logs, get, orders/*, allow

          # Read-only for architects and auditors
          g, architects, role:readonly
        EOT
      }
    }

    # ArgoCD server — HA with 2 replicas
    server = {
      replicas = 2

      # Ingress — via Nginx Ingress Controller
      ingress = {
        enabled          = true
        ingressClassName = "nginx"
        annotations = {
          "nginx.ingress.kubernetes.io/ssl-passthrough"    = "true"
          "nginx.ingress.kubernetes.io/force-ssl-redirect" = "true"
        }
        hosts = ["argocd.internal.company.com"]
        tls = [{
          secretName = "argocd-tls"
          hosts      = ["argocd.internal.company.com"]
        }]
      }

      resources = {
        requests = { cpu = "100m", memory = "256Mi" }
        limits   = { cpu = "1000m", memory = "1Gi" }
      }
    }

    # Application controller — processes reconciliation
    controller = {
      replicas = 2   # HA — one per AZ

      resources = {
        requests = { cpu = "250m", memory = "512Mi" }
        limits   = { cpu = "2000m", memory = "2Gi" }
      }
    }

    # Repo server — clones and templates Git repos
    repoServer = {
      replicas = 2

      resources = {
        requests = { cpu = "100m", memory = "256Mi" }
        limits   = { cpu = "1000m", memory = "1Gi" }
      }
    }

    # Redis — ArgoCD state store, HA mode
    redis-ha = {
      enabled = true   # HA Redis for production — 3 replicas
    }

    # Run ArgoCD on the system node group — not on Karpenter-managed nodes
    global = {
      tolerations = [{
        key      = "node-role"
        value    = "system"
        operator = "Equal"
        effect   = "NoSchedule"
      }]
      affinity = {
        nodeAffinity = {
          requiredDuringSchedulingIgnoredDuringExecution = {
            nodeSelectorTerms = [{
              matchExpressions = [{
                key      = "node-role"
                operator = "In"
                values   = ["system"]
              }]
            }]
          }
        }
      }
    }
  })]
}

Why ArgoCD must run on the system node group: If ArgoCD runs on a Karpenter-managed node and Karpenter consolidates that node, ArgoCD goes down. With ArgoCD down, no deployments can happen — including the deployment that would fix the problem. Pin it to the system MNG permanently.

2. App of Apps — The Bootstrap Pattern

The Problem It Solves

When you have 33 microservices, you need 33 ArgoCD Application objects. You could apply them manually with kubectl — but then those Application objects themselves are unmanaged, untracked, and can drift from Git. The App of Apps pattern solves this: one root Application manages all other Application objects, which in turn manage your services.

Root App (managed by humans — applied once)
    ↓
Apps of Apps (managed by Root App)
    ├── platform-apps (manages platform-level Applications)
    │   ├── ArgoCD Notifications App
    │   ├── Karpenter App
    │   ├── Ingress Nginx App
    │   └── cert-manager App
    └── service-apps (manages service-level Applications)
        ├── orders-service App
        ├── payment-processing-service App
        ├── enrichment-service App
        └── ... (30 more)

The Root Application — Applied Once, Manages Everything

# root-app.yaml — applied manually ONCE during cluster bootstrap
# After this, everything is managed by ArgoCD
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: root
  namespace: argocd
  finalizers:
    # Cascade delete — deleting this app deletes all child apps
    # Remove this in production to prevent accidental mass deletion
    # - resources-finalizer.argocd.argoproj.io
spec:
  project: default

  source:
    repoURL: https://github.com/org/gitops-config
    targetRevision: main
    path: bootstrap/root   # Points to the apps-of-apps directory

  destination:
    server: https://kubernetes.default.svc
    namespace: argocd

  syncPolicy:
    automated:
      prune: false    # Never auto-delete Application objects — too risky
      selfHeal: true  # Revert manual changes to Application objects
    syncOptions:
      - CreateNamespace=true

The Apps of Apps Directory Structure

gitops-config/
├── bootstrap/
│   └── root/                          # Root app points here
│       ├── platform-apps.yaml         # Application for platform tooling
│       └── service-apps.yaml          # ApplicationSet for all services
│
├── apps/
│   ├── platform/                      # Platform-level Helm releases
│   │   ├── karpenter/
│   │   │   ├── Chart.yaml
│   │   │   └── values.yaml
│   │   ├── ingress-nginx/
│   │   └── argocd-notifications/
│   │
│   └── services/                      # Service Helm charts
│       ├── orders/
│       │   ├── Chart.yaml
│       │   ├── values.yaml
│       │   ├── values-dev.yaml
│       │   └── values-prod.yaml
│       └── payment-processing/
│
└── projects/                          # ArgoCD Project definitions
    ├── orders-project.yaml
    └── payments-project.yaml

Platform Apps — Managing Cluster Tooling via ArgoCD

# bootstrap/root/platform-apps.yaml
# ArgoCD manages its own add-ons — including Karpenter, Ingress, monitoring
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: platform-apps
  namespace: argocd
spec:
  project: platform

  source:
    repoURL: https://github.com/org/gitops-config
    targetRevision: main
    path: apps/platform

  destination:
    server: https://kubernetes.default.svc
    namespace: argocd

  syncPolicy:
    automated:
      prune: true
      selfHeal: true
    syncOptions:
      - CreateNamespace=true
      - ServerSideApply=true

3. ApplicationSet — One Template, All Environments

ApplicationSet is the most powerful ArgoCD feature for multi-environment, multi-cluster platforms. Instead of writing one Application per service per environment, you write one template and ArgoCD generates all the Applications automatically.

Generator Types — Choosing the Right One

Generator	What It Does	Best For
List	Explicit list of environments/clusters	Fixed, known set of targets
Git	Discovers targets from directory structure in Git	Dynamic — new directory = new app
Matrix	Combines two generators (e.g., services × environments)	Large service counts × environment counts
Cluster	Generates one app per registered ArgoCD cluster	Multi-cluster deployments
SCM Provider	Discovers repos from GitHub/GitLab org	Large organisations with many repos

Pattern 1 — List Generator: Multi-Environment Service Deployment

# bootstrap/root/service-apps.yaml
apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
  name: orders-service
  namespace: argocd
spec:
  # Prevent accidental deletion of all generated apps
  preserveResourcesOnDeletion: true

  generators:
    - list:
        elements:
          - env:         dev
            cluster:     https://dev-cluster.example.com
            namespace:   orders-dev
            valuesFile:  values-dev.yaml
            autoSync:    "true"
            prune:       "true"

          - env:         staging
            cluster:     https://dev-cluster.example.com   # Shares dev cluster
            namespace:   orders-staging
            valuesFile:  values-staging.yaml
            autoSync:    "true"
            prune:       "true"

          - env:         prod
            cluster:     https://prod-cluster.example.com
            namespace:   orders
            valuesFile:  values-prod.yaml
            autoSync:    "false"   # Manual sync in production
            prune:       "false"   # Never auto-prune prod resources

  template:
    metadata:
      name: "orders-{env}"
      namespace: argocd
      labels:
        environment: "{env}"
        app: orders
    spec:
      project: "orders-{env}"

      source:
        repoURL:        https://github.com/org/gitops-config
        targetRevision: main
        path:           apps/services/orders
        helm:
          valueFiles:
            - values.yaml          # Base — always loaded first
            - "{valuesFile}"     # Environment-specific override

      destination:
        server:    "{cluster}"
        namespace: "{namespace}"

      syncPolicy:
        automated:
          prune:     "{prune}"
          selfHeal:  "{autoSync}"
        syncOptions:
          - CreateNamespace=true
          - ServerSideApply=true
          - PrunePropagationPolicy=foreground
        retry:
          limit: 3
          backoff:
            duration:    5s
            factor:      2
            maxDuration: 3m

      # Ignore differences that are managed externally
      ignoreDifferences:
        - group: apps
          kind:  Deployment
          jsonPointers:
            - /spec/replicas   # HPA manages replicas — ArgoCD must not revert HPA scaling

Critical: ignoreDifferences on /spec/replicas is essential when using HPA. Without it, ArgoCD continuously reverts the replica count to whatever is in your Helm values, overriding what HPA has set based on actual load. This causes a fight between ArgoCD and HPA that results in constant pod churn.

Pattern 2 — Git Generator: Auto-Discover Services from Directory Structure

For large platforms where new services are added frequently, the Git generator eliminates the need to update the ApplicationSet every time a new service appears:

# One ApplicationSet discovers ALL services automatically
apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
  name: all-services-prod
  namespace: argocd
spec:
  generators:
    - git:
        repoURL:  https://github.com/org/gitops-config
        revision: main
        directories:
          - path: apps/services/*   # Every subdirectory = one Application

  template:
    metadata:
      name: "{path.basename}-prod"   # Directory name becomes app name
      namespace: argocd
    spec:
      project: services-prod

      source:
        repoURL:        https://github.com/org/gitops-config
        targetRevision: main
        path:           "{path}"   # Directory path is used as chart path
        helm:
          valueFiles:
            - values.yaml
            - values-prod.yaml

      destination:
        server:    https://prod-cluster.example.com
        namespace: "{path.basename}"   # Service name becomes namespace name

      syncPolicy:
        automated:
          prune:    false
          selfHeal: true

How adding a new service works:

Create apps/services/new-scoring-service/ with Chart.yaml + values.yaml
Commit and push to main
ArgoCD detects the new directory within ~3 minutes
A new Application new-scoring-service-prod is automatically created
ArgoCD syncs and deploys — no manual Application creation needed

Pattern 3 — Matrix Generator: All Services × All Environments

# The most powerful pattern — N services × M environments = N×M Applications
apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
  name: all-services-all-envs
  namespace: argocd
spec:
  generators:
    - matrix:
        generators:
          # First dimension: discover services from Git
          - git:
              repoURL:  https://github.com/org/gitops-config
              revision: main
              directories:
                - path: apps/services/*

          # Second dimension: fixed list of environments
          - list:
              elements:
                - env:       dev
                  cluster:   https://dev-cluster.example.com
                  autoSync:  "true"
                - env:       prod
                  cluster:   https://prod-cluster.example.com
                  autoSync:  "false"

  template:
    metadata:
      name: "{path.basename}-{env}"
    spec:
      project: "services-{env}"
      source:
        repoURL:        https://github.com/org/gitops-config
        targetRevision: main
        path:           "{path}"
        helm:
          valueFiles:
            - values.yaml
            - "values-{env}.yaml"
      destination:
        server:    "{cluster}"
        namespace: "{path.basename}"
      syncPolicy:
        automated:
          selfHeal: "{autoSync}"
          prune:    "{autoSync}"

4. ArgoCD Projects — Scoping Access and Reducing Blast Radius

An ArgoCD Project defines boundaries — which source repos an Application can pull from, which destination clusters and namespaces it can deploy to, and which Kubernetes resources it can manage. Without Projects, every Application can theoretically deploy anywhere.

# projects/orders-project.yaml
apiVersion: argoproj.io/v1alpha1
kind: AppProject
metadata:
  name: orders-prod
  namespace: argocd
spec:
  description: "Order processing services — production cluster"

  # Only pull from the approved GitOps repo
  sourceRepos:
    - https://github.com/org/gitops-config
    - https://github.com/org/orders-service

  # Only deploy to the orders namespace in prod
  destinations:
    - server:    https://prod-cluster.example.com
      namespace: orders

  # Cluster-scoped resources this project CANNOT manage
  clusterResourceBlacklist:
    - group: ""
      kind:  Node
    - group: rbac.authorization.k8s.io
      kind:  ClusterRole
    - group: rbac.authorization.k8s.io
      kind:  ClusterRoleBinding

  # Namespace-scoped resources this project CAN manage
  namespaceResourceWhitelist:
    - group: apps
      kind:  Deployment
    - group: apps
      kind:  StatefulSet
    - group: ""
      kind:  Service
    - group: ""
      kind:  ConfigMap
    - group: autoscaling
      kind:  HorizontalPodAutoscaler
    - group: networking.k8s.io
      kind:  Ingress
    - group: policy
      kind:  PodDisruptionBudget

  # RBAC — which ArgoCD roles can do what in this project
  roles:
    - name: orders-deployer
      description: "Orders team — can sync and view, cannot delete"
      policies:
        - p, proj:orders-prod:orders-deployer, applications, get, *, allow
        - p, proj:orders-prod:orders-deployer, applications, sync, *, allow
        - p, proj:orders-prod:orders-deployer, applications, action/*, *, allow
        - p, proj:orders-prod:orders-deployer, logs, get, *, allow
      groups:
        - orders-team

    - name: platform-admin
      description: "Platform team — full access"
      policies:
        - p, proj:orders-prod:platform-admin, applications, *, *, allow
      groups:
        - platform-engineers

  # Sync windows — prevent deployments during business-critical hours
  syncWindows:
    - kind:         deny
      schedule:     "0 8 * * MON-FRI"   # Block syncs 8am–10am weekdays
      duration:     2h                   # Peak transaction processing window
      applications:
        - "*"
      namespaces:
        - orders
      manualSync: true   # Even manual syncs blocked during this window

Why sync windows matter in production: A deployment at 9am on a Monday during peak transaction load is the highest-risk time to introduce a change. Sync windows are the ArgoCD-native way to enforce deployment freeze periods without relying on human discipline. They are configured in Git, visible to the whole team, and enforced automatically.

5. Sync Waves — Ordered Deployment Within a Sync

Sync Waves solve the dependency ordering problem: a database schema migration must complete before the application that depends on the new schema starts. A ConfigMap must exist before the Deployment that mounts it.

Without Sync Waves, ArgoCD applies all resources simultaneously — which works for independent resources but fails silently for dependent ones.

How Waves Work

Resources are applied in ascending wave order. ArgoCD waits for all resources in wave N to be healthy before starting wave N+1.

Wave -1: Namespace, ResourceQuota (must exist before anything else)
Wave 0:  ConfigMaps, Secrets references (default — no annotation needed)
Wave 1:  Database migration Job (must complete before app starts)
Wave 2:  Service, ServiceAccount, PodDisruptionBudget
Wave 3:  Deployment (starts after migration is complete)
Wave 4:  HorizontalPodAutoscaler, Ingress (after pods are running)
Wave 5:  Smoke test Job (validates deployment succeeded)

# Namespace — applied first, before any other resource
apiVersion: v1
kind: Namespace
metadata:
  name: orders
  annotations:
    argocd.argoproj.io/sync-wave: "-1"

---
# Database migration Job — must complete before app Deployment
apiVersion: batch/v1
kind: Job
metadata:
  name: orders-db-migration
  namespace: orders
  annotations:
    argocd.argoproj.io/sync-wave: "1"
    argocd.argoproj.io/hook: Sync
    argocd.argoproj.io/hook-delete-policy: BeforeHookCreation
spec:
  backoffLimit: 3
  activeDeadlineSeconds: 300   # Fail if migration takes more than 5 minutes
  template:
    spec:
      restartPolicy: Never
      containers:
        - name: migration
          image: 123456789012.dkr.ecr.eu-west-1.amazonaws.com/orders-db-migrator:latest
          command: ["python", "manage.py", "migrate", "--no-input"]
          env:
            - name: DATABASE_URL
              valueFrom:
                secretKeyRef:
                  name: orders-db-credentials
                  key:  url

---
# Service — wave 2, before Deployment
apiVersion: v1
kind: Service
metadata:
  name: order-processor-service
  namespace: orders
  annotations:
    argocd.argoproj.io/sync-wave: "2"
spec:
  selector:
    app: order-processor
  ports:
    - port: 8080

---
# Deployment — wave 3, after migration and Service exist
apiVersion: apps/v1
kind: Deployment
metadata:
  name: order-processor
  namespace: orders
  annotations:
    argocd.argoproj.io/sync-wave: "3"
spec:
  replicas: 2
  selector:
    matchLabels:
      app: order-processor
  template:
    metadata:
      labels:
        app: order-processor
    spec:
      containers:
        - name: scoring
          image: 123456789012.dkr.ecr.eu-west-1.amazonaws.com/order-processor:latest

---
# HPA — wave 4, after Deployment exists
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: order-processor-hpa
  namespace: orders
  annotations:
    argocd.argoproj.io/sync-wave: "4"
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind:       Deployment
    name:       order-processor
  minReplicas: 2
  maxReplicas: 20
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type:               Utilization
          averageUtilization: 65

---
# Ingress — wave 4, after Deployment is healthy
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: order-processor-ingress
  namespace: orders
  annotations:
    argocd.argoproj.io/sync-wave: "4"
    kubernetes.io/ingress.class:  nginx
spec:
  rules:
    - host: orders-api.internal.company.com
      http:
        paths:
          - path:     /api/v1/score
            pathType: Prefix
            backend:
              service:
                name: order-processor-service
                port:
                  number: 8080

---
# Post-sync smoke test — wave 5, validates deployment succeeded
apiVersion: batch/v1
kind: Job
metadata:
  name: order-processor-smoke-test
  namespace: orders
  annotations:
    argocd.argoproj.io/sync-wave:        "5"
    argocd.argoproj.io/hook:             PostSync
    argocd.argoproj.io/hook-delete-policy: HookSucceeded
spec:
  backoffLimit: 2
  activeDeadlineSeconds: 60
  template:
    spec:
      restartPolicy: Never
      containers:
        - name: smoke-test
          image: curlimages/curl:latest
          command:
            - /bin/sh
            - -c
            - |
              response=$(curl -s -o /dev/null -w "%{http_code}"                 http://order-processor-service.orders.svc.cluster.local:8080/health)
              if [ "$response" != "200" ]; then
                echo "Smoke test failed — health check returned $response"
                exit 1
              fi
              echo "Smoke test passed — service is healthy"

Sync Hooks — Running Jobs at Specific Points

Hook	When It Runs	Use Case
`PreSync`	Before any resources are applied	Drain connections, notify downstream
`Sync`	During the sync (respects wave order)	Database migrations
`PostSync`	After all resources are healthy	Smoke tests, notifications, cache warm
`SyncFail`	If the sync fails	Alert, rollback trigger, cleanup

# SyncFail hook — alert when a production deployment fails
apiVersion: batch/v1
kind: Job
metadata:
  name: deployment-failure-alert
  annotations:
    argocd.argoproj.io/hook:             SyncFail
    argocd.argoproj.io/hook-delete-policy: BeforeHookCreation
spec:
  template:
    spec:
      restartPolicy: Never
      containers:
        - name: alert
          image: curlimages/curl:latest
          command:
            - /bin/sh
            - -c
            - |
              curl -X POST $SLACK_WEBHOOK_URL                 -H "Content-Type: application/json"                 -d '{"text": "🔴 DEPLOYMENT FAILED: orders-prod sync failed. Check ArgoCD immediately."}'
          env:
            - name: SLACK_WEBHOOK_URL
              valueFrom:
                secretKeyRef:
                  name: alerting-secrets
                  key:  slack-webhook

6. Multi-Cluster Strategy — Managing Dev and Prod from One ArgoCD

ArgoCD can manage multiple clusters from a single installation. This is the standard pattern for an organisation with a dedicated prod cluster and a shared dev/staging cluster.

Registering Clusters with ArgoCD

# Register the prod cluster with ArgoCD
argocd cluster add prod-eks-cluster   --name prod   --kubeconfig ~/.kube/prod-config   --in-cluster=false

# Verify registration
argocd cluster list

SERVER                              NAME   VERSION  STATUS
https://prod-cluster.example.com    prod   1.30     Successful
https://kubernetes.default.svc      dev    1.30     Successful

Terraform — Automating Cluster Registration

resource "kubernetes_secret" "argocd_prod_cluster" {
  metadata {
    name      = "prod-cluster"
    namespace = "argocd"
    labels = {
      "argocd.argoproj.io/secret-type" = "cluster"
    }
  }

  data = {
    name   = "prod"
    server = aws_eks_cluster.prod.endpoint
    config = jsonencode({
      bearerToken = data.aws_eks_cluster_auth.prod.token
      tlsClientConfig = {
        insecure = false
        caData   = aws_eks_cluster.prod.certificate_authority[0].data
      }
    })
  }

  type = "Opaque"
}

Network Connectivity Between Clusters

ArgoCD (running in the dev/platform cluster) must be able to reach the prod cluster's API server. Since the prod cluster has endpoint_public_access = false, traffic must route privately:

ArgoCD pod (dev cluster, eu-west-1)
    → Pod IP → Dev VPC
        → Transit Gateway
            → Prod VPC
                → Prod EKS API server (private endpoint: 10.x.x.x)

# TGW route — dev cluster pods can reach prod cluster API endpoint
resource "aws_ec2_transit_gateway_route" "dev_to_prod_api" {
  destination_cidr_block         = "10.1.0.0/16"   # Prod VPC CIDR
  transit_gateway_attachment_id  = aws_ec2_transit_gateway_vpc_attachment.dev.id
  transit_gateway_route_table_id = aws_ec2_transit_gateway_route_table.platform.id
}

# Security group rule — prod cluster API server accepts from dev cluster ArgoCD pods
resource "aws_security_group_rule" "argocd_to_prod_api" {
  type              = "ingress"
  from_port         = 443
  to_port           = 443
  protocol          = "tcp"
  cidr_blocks       = ["10.0.0.0/16"]   # Dev VPC CIDR — ArgoCD pod IPs
  security_group_id = aws_security_group.prod_eks_cluster.id
  description       = "ArgoCD from dev cluster to prod API server"
}

7. ArgoCD Notifications — Know When Deployments Fail

ArgoCD without notifications is a black box. Engineers must manually check the UI to know if a deployment succeeded. In production, a failed deployment must trigger an immediate alert.

resource "helm_release" "argocd_notifications" {
  name       = "argocd-notifications"
  repository = "https://argoproj.github.io/argo-helm"
  chart      = "argocd-notifications"
  version    = "1.8.1"
  namespace  = "argocd"

  values = [yamlencode({
    tolerations = [{
      key = "node-role", value = "system",
      operator = "Equal", effect = "NoSchedule"
    }]
  })]
}

# argocd-notifications-cm ConfigMap — triggers and templates
apiVersion: v1
kind: ConfigMap
metadata:
  name: argocd-notifications-cm
  namespace: argocd
data:
  # Notification triggers — when to notify
  trigger.on-sync-failed: |
    - when: app.status.operationState.phase in ['Error', 'Failed']
      send: [app-sync-failed]

  trigger.on-sync-succeeded: |
    - when: app.status.operationState.phase in ['Succeeded']
      send: [app-sync-succeeded]

  trigger.on-deployed: |
    - when: app.status.operationState.phase in ['Succeeded'] and app.status.health.status == 'Healthy'
      send: [app-deployed]

  trigger.on-health-degraded: |
    - when: app.status.health.status == 'Degraded'
      send: [app-health-degraded]

  # Notification templates — what to send
  template.app-sync-failed: |
    slack:
      attachments: |
        [{
          "title": "🔴 Deployment Failed: {{.app.metadata.name}}",
          "color": "#E53E3E",
          "fields": [
            {"title": "Environment", "value": "{{.app.metadata.labels.environment}}", "short": true},
            {"title": "Namespace", "value": "{{.app.spec.destination.namespace}}", "short": true},
            {"title": "Error", "value": "{{.app.status.operationState.message}}", "short": false}
          ],
          "actions": [{
            "type": "button",
            "text": "View in ArgoCD",
            "url": "{{.context.argocdUrl}}/applications/{{.app.metadata.name}}"
          }]
        }]

  template.app-sync-succeeded: |
    slack:
      attachments: |
        [{
          "title": "✅ Deployment Succeeded: {{.app.metadata.name}}",
          "color": "#38A169",
          "fields": [
            {"title": "Environment", "value": "{{.app.metadata.labels.environment}}", "short": true},
            {"title": "Revision", "value": "{{.app.status.sync.revision}}", "short": true}
          ]
        }]

  template.app-health-degraded: |
    slack:
      attachments: |
        [{
          "title": "⚠️ Application Degraded: {{.app.metadata.name}}",
          "color": "#D69E2E",
          "fields": [
            {"title": "Health Status", "value": "{{.app.status.health.status}}", "short": true},
            {"title": "Namespace", "value": "{{.app.spec.destination.namespace}}", "short": true}
          ]
        }]

  # Notification services — where to send
  service.slack: |
    token: $slack-token

---
# Subscribe Applications to notifications via annotations
# Add to each Application or ApplicationSet template
metadata:
  annotations:
    notifications.argoproj.io/subscribe.on-sync-failed.slack:     "platform-alerts"
    notifications.argoproj.io/subscribe.on-health-degraded.slack:  "platform-alerts"
    notifications.argoproj.io/subscribe.on-deployed.slack:         "deployments"

8. Image Updater — Automating Image Tag Updates

Without Image Updater, updating a service's image tag requires a Git commit to change image.tag in values.yaml — then waiting for ArgoCD to sync. With Image Updater, ArgoCD watches ECR for new image tags and automatically commits the update to Git.

# argocd-image-updater ConfigMap
apiVersion: v1
kind: ConfigMap
metadata:
  name: argocd-image-updater-config
  namespace: argocd
data:
  # ECR registry — authenticate via Pod Identity
  registries.conf: |
    registries:
      - name: ECR
        prefix: 123456789012.dkr.ecr.eu-west-1.amazonaws.com
        api_url: https://123456789012.dkr.ecr.eu-west-1.amazonaws.com
        credentials: ext:/scripts/ecr-login.sh
        default: true

---
# Annotate Application to enable Image Updater
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: orders-dev
  annotations:
    # Watch this image for new tags
    argocd-image-updater.argoproj.io/image-list: |
      scoring=123456789012.dkr.ecr.eu-west-1.amazonaws.com/order-processor

    # Update strategy — semver, latest, digest
    argocd-image-updater.argoproj.io/scoring.update-strategy: semver
    argocd-image-updater.argoproj.io/scoring.allow-tags: "regexp:^v[0-9]+\.[0-9]+\.[0-9]+$"

    # Write back to Git (not just in-memory)
    argocd-image-updater.argoproj.io/write-back-method: git
    argocd-image-updater.argoproj.io/git-branch: main

    # Which Helm value to update
    argocd-image-updater.argoproj.io/scoring.helm.image-name: image.repository
    argocd-image-updater.argoproj.io/scoring.helm.image-tag:  image.tag

Use Image Updater for dev only. In production, image tag updates should go through a PR review. Never enable Image Updater with autoSync: true in production — that is a direct pipeline from ECR to production with no human review.

9. The Production GitOps Workflow — End to End

Developer pushes code to feature branch
    ↓
CI pipeline runs (GitHub Actions):
  - Build Docker image
  - Run unit + integration tests
  - Push to ECR with semver tag (e.g., v1.4.2)
  - Run Terraform plan (infrastructure changes)
  - Run tfsec / Checkov security scan
  - Post results as PR comment
    ↓
PR opened → review by peer + platform team
    ↓
PR approved + merged to main
    ↓
CI pipeline runs on main:
  - Build and push release image to ECR
  - Update image tag in values-dev.yaml → commit to gitops-config repo
    ↓
ArgoCD detects change in gitops-config (within ~3 minutes)
    ↓
ArgoCD syncs orders-dev (autoSync: true)
  - Sync Wave 1: database migration Job runs and completes
  - Sync Wave 3: Deployment rolls out new image
  - Sync Wave 5: smoke test Job validates health
  - Notification → #deployments: "✅ orders-dev deployed v1.4.2"
    ↓
Staging sync (autoSync: true, same cluster as dev)
    ↓
QA validation on staging
    ↓
Platform team triggers MANUAL sync in ArgoCD UI for prod:
  - Reviews diff — exactly which resources change
  - Checks sync wave order — migration before deployment
  - Clicks Sync
    ↓
ArgoCD syncs orders-prod (manual)
  - Same wave sequence, same smoke test
  - Notification → #platform-alerts: "✅ orders-prod deployed v1.4.2"
    ↓
DORA audit trail:
  - Git commit: who changed what, when, PR review
  - ArgoCD sync history: what deployed, to which cluster, at what time
  - ECR image: immutable tag, digest verifiable

10. Common Mistakes & Anti-Patterns

Mistake 1: `autoSync: true` in Production With `selfHeal: true`

The most dangerous combination. A bad merge to main + autoSync = immediate production rollout. selfHeal = any manual emergency fix via kubectl is reverted within seconds. In production, autoSync: false is non-negotiable. selfHeal can be true — but only if autoSync is false (ArgoCD will revert manual changes but won't auto-deploy new commits).

Mistake 2: No `ignoreDifferences` for HPA-Managed Replicas

ArgoCD continuously reconciles /spec/replicas to match your Helm values. HPA continuously adjusts replicas based on load. The result is a constant fight — HPA scales to 8, ArgoCD reverts to 2, HPA scales back to 8. Add ignoreDifferences for /spec/replicas on every Deployment managed by HPA.

Mistake 3: Storing Secrets in the GitOps Repo

Values files in Git must never contain secrets. Use External Secrets Operator to pull from AWS Secrets Manager at sync time:

apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: orders-db-credentials
  namespace: orders
spec:
  refreshInterval: 1h
  secretStoreRef:
    name:  aws-secrets-manager
    kind:  ClusterSecretStore
  target:
    name:           orders-db-credentials
    creationPolicy: Owner
  data:
    - secretKey: url
      remoteRef:
        key:      prod/orders/db-credentials
        property: connection_url

Mistake 4: Sync Waves Without Health Checks

A Deployment in wave 3 is healthy when its pods are Running and Ready. A Job in wave 1 is complete when its status is Complete. A migration Job that hangs indefinitely will block the entire sync — always set activeDeadlineSeconds.

Mistake 5: No App of Apps — Managing Applications Manually

Manually applying Application YAML files means those Application objects themselves are not tracked in Git. An accidentally deleted Application takes down the service it manages with no record in Git of what it looked like. The App of Apps pattern ensures Application objects are themselves managed as code.

Mistake 6: One ArgoCD Project for Everything

A single default Project means every team can see, sync, and delete every other team's Applications. Projects are free to create and provide the namespace and cluster scoping that prevents accidental cross-team interference. Create one Project per team per environment.

Mistake 7: Not Pinning ArgoCD Version

ArgoCD upgrades can change behaviour — ApplicationSet generators, sync policies, health check logic. Pin the Helm chart version and the ArgoCD image tag in your Terraform/Helm values. Test upgrades in dev before upgrading prod.

Architecture Decision Matrix

Pattern	Use Case	Complexity	When to Use
Single Application	One service, one environment	Low	Getting started, simple setups
App of Apps	Bootstrap all Applications from Git	Medium	Any production cluster
ApplicationSet (List)	Fixed set of envs per service	Low-Medium	Standard multi-env deployment
ApplicationSet (Git)	Auto-discover services from repo	Medium	Large platforms, frequent new services
ApplicationSet (Matrix)	N services × M environments	Medium-High	Enterprise platforms
Sync Waves	Ordered deployment — migrations first	Medium	Any service with DB dependencies
Sync Hooks	Pre/Post actions — smoke tests	Medium	Production deployments
Multi-cluster	Prod + dev from one ArgoCD	Medium	Two+ clusters in your organisation
Image Updater	Auto-update image tags in Git	Medium	Dev/staging continuous delivery

The Golden Rule

“GitOps is not about the tool — it is about the discipline. ArgoCD is the enforcement mechanism, but the value comes from the commitment: Git is the only source of truth, every change is a pull request, every deployment has an author and a timestamp, and production syncs are always deliberate. App of Apps from Day 1 so your Application objects are themselves managed as code. Sync Waves for every service with dependencies. autoSync: false in production, always. And notifications on every sync outcome — because a deployment you don't know about is an incident you're not prepared for.”

Tags: #ArgoCD #GitOps #EKS #Kubernetes #Helm #DevSecOps #IaC #DORA #AppOfApps #ApplicationSet #SyncWaves #MultiCluster

Ankush Panday

Specializing in highly scalable AWS infrastructure and automated quality engineering.

Connect on LinkedIn