AWS Observability Stack: CloudWatch, X-Ray, OpenTelemetry & What's Still Missing

May 24, 2026 • 20 min read Observability SRE

AWS Series | Part 16 — Building secure, cost-optimised, cloud-native infrastructure on AWS.

AWS Observability Stack: CloudWatch, X-Ray, OpenTelemetry

TL;DR

Tool	What It Does	Layer	When You Need It
CloudWatch Metrics	Time-series metrics from AWS services + custom	Infrastructure	Always — Day 1
CloudWatch Logs	Log ingestion, query, alerting	Application	Always — Day 1
CloudWatch Container Insights	Pod/node/cluster metrics for EKS/ECS	Container	EKS/ECS workloads
CloudWatch Alarms	Threshold-based alerting on metrics	Infrastructure	Always
AWS X-Ray	Distributed tracing across services	Application	Microservices, latency debugging
AWS Distro for OpenTelemetry (ADOT)	Vendor-neutral instrumentation + collection	Application	When you want portability
Prometheus + Grafana	Kubernetes-native metrics + dashboards	Container	EKS at scale
CloudWatch Synthetics	Canary tests — simulate user traffic	Endpoint	Public-facing APIs
CloudWatch RUM	Real User Monitoring — actual browser sessions	Frontend	Web applications

Introduction — The Incident That Made This Blog Necessary

A newly onboarded microservice entered a crash loop. The pod was failing. No alarm fired. No alert reached the team. The service sat degraded in silence for longer than it should have.

The root cause was simple in hindsight: the CloudWatch log group for that service had not been added to the alarm configuration. The observability stack was monitoring the log groups that existed when it was built — this service was added after. The monitoring had not been extended to cover it.

This is the observability failure mode that nobody talks about: not the absence of tooling, but the gap between what the tools cover and what's actually running. You can have CloudWatch, X-Ray, and OpenTelemetry all deployed and still miss a degraded service because the alarm was never wired up to the new log group.

This post covers the complete AWS observability stack — metrics, logs, traces, alarms, and synthetic monitoring — and more importantly, how to build it so that gaps like this are structurally impossible. Every new service that is deployed automatically gets observability coverage. No manual wiring. No forgotten log groups.

The immediate fix: move from log-based alarms to Container Insights metric alarms on pod_number_of_container_restarts. A metric that exists for every pod in the cluster automatically, without any log group configuration.

The complete stack covered in this post — CloudWatch Container Insights, structured logging with Fluent Bit, distributed tracing with ADOT and X-Ray, CloudWatch Synthetics, and SLO tracking — is the production-grade observability model for any EKS platform. Not all of it needs to be in place on Day 1. The maturity model in Section 9 gives you the right adoption order.

1. The Three Pillars of Observability

Before any tool selection, understand the three distinct signals you need:

Metrics  — What is happening? (CPU 87%, error rate 2.3%, latency p99 340ms)
Logs     — Why is it happening? (stack trace, config error, null pointer)
Traces   — Where is it happening? (which service, which downstream call, which database query)

Each pillar answers a different question. An alarm fires (metric). You open the logs to find the error (logs). You trace the request path to find the bottleneck (traces). You need all three — and they need to be correlated so you can move between them without losing context.

CloudWatch Alarm fires: pod_number_of_container_restarts > 3
    ↓
Open Container Insights: which pod, which node, which namespace
    ↓
Open CloudWatch Logs: what error is the pod throwing
    ↓
Open X-Ray trace: which downstream call is causing the error
    ↓
Root cause: RDS connection pool exhausted — traced to a specific query

Without this chain, you debug by guessing. With it, you debug by following the evidence.

2. CloudWatch Container Insights — The Observability Foundation for EKS

Why Container Insights, Not Log-Based Alarms

The incident described in the introduction happened because alarms were watching CloudWatch log groups — which must be explicitly configured per service. Container Insights metrics are different: they are emitted automatically for every pod, every node, and every cluster once the CloudWatch agent DaemonSet is deployed. No per-service configuration. No forgotten log groups.

Log-based alarm:
  CloudWatch Logs → Metric Filter → Alarm
  Problem: new service = new log group = new metric filter = new alarm = manual step

Container Insights metric alarm:
  CloudWatch Agent DaemonSet → Container Insights → Alarm on pod_container_restarts
  New service = new pod = automatically covered by existing alarm

Deploying the CloudWatch Agent via Helm

# Terraform — CloudWatch agent via Helm
resource "helm_release" "cloudwatch_agent" {
  name             = "amazon-cloudwatch-observability"
  repository       = "https://aws.github.io/eks-charts"
  chart            = "amazon-cloudwatch-observability"
  version          = "1.5.0"
  namespace        = "amazon-cloudwatch"
  create_namespace = true

  set {
    name  = "clusterName"
    value = aws_eks_cluster.main.name
  }

  set {
    name  = "region"
    value = var.region
  }

    # Run on every node — DaemonSet, not Deployment
  # No node selector needed — DaemonSet handles this automatically

  set {
    name  = "containerInsights.enabled"
    value = "true"
  }

  set {
    name  = "fluentBit.enabled"
    value = "true"
  }

  depends_on = [aws_eks_addon.vpc_cni]
}

Key Container Insights Metrics

Once deployed, these metrics are available in CloudWatch under the ContainerInsights namespace — automatically, for every pod:

Metric	What It Measures	Alarm Threshold
`pod_number_of_container_restarts`	Container restart count	> 3 in 5 minutes
`pod_cpu_utilization`	Pod CPU % of request	> 80% for 10 min
`pod_memory_utilization`	Pod memory % of limit	> 85% for 10 min
`pod_network_rx_bytes`	Inbound network bytes	Baseline + 3σ
`node_cpu_utilization`	Node CPU %	> 80% for 15 min
`node_memory_utilization`	Node memory %	> 85% for 15 min
`node_disk_iops`	Node disk I/O	Baseline deviation
`cluster_node_count`	Total nodes in cluster	< minimum threshold
`cluster_failed_node_count`	Failed nodes	> 0

The Alarm That Prevents Silent Pod Failures

This is the alarm that would have caught the incident. It monitors container restarts across the entire cluster — every namespace, every pod, every service — with no per-service configuration:

# Alarm on pod restarts — covers ALL pods automatically
# New services are covered immediately when they deploy
resource "aws_cloudwatch_metric_alarm" "pod_restarts" {
  alarm_name          = "eks-pod-container-restarts-high"
  alarm_description   = "Pod restarting repeatedly — likely crash loop. Check CloudWatch Container Insights."
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 2
  metric_name         = "pod_number_of_container_restarts"
  namespace           = "ContainerInsights"
  period              = 300   # 5 minutes
  statistic           = "Sum"
  threshold           = 5     # More than 5 restarts in 5 minutes

  dimensions = {
    ClusterName = aws_eks_cluster.main.name
        # No Namespace or PodName dimension — covers ALL pods cluster-wide
  }

  alarm_actions = [
    aws_sns_topic.platform_alerts.arn,
    aws_sns_topic.pagerduty.arn        # Page on-call immediately
  ]
  ok_actions = [aws_sns_topic.platform_alerts.arn]

  treat_missing_data = "notBreaching"
  tags               = { Name = "pod-restart-alarm" }
}

# Per-namespace alarm — more granular, for teams that own their namespace
resource "aws_cloudwatch_metric_alarm" "pod_restarts_namespace" {
  for_each = toset(["fraud-detection", "payments", "identity", "data-pipeline"])

  alarm_name          = "eks-pod-restarts-${each.key}"
  alarm_description   = "Pod crash loop in ${each.key} namespace"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 1
  metric_name         = "pod_number_of_container_restarts"
  namespace           = "ContainerInsights"
  period              = 300
  statistic           = "Sum"
  threshold           = 3

  dimensions = {
    ClusterName = aws_eks_cluster.main.name
    Namespace   = each.key   # Scoped to namespace — team-level alert
  }

  alarm_actions = [aws_sns_topic.team_alerts[each.key].arn]
  tags          = { Name = "pod-restarts-${each.key}" }
}

Why this matters in production: The cluster-wide alarm fires regardless of which namespace or service is restarting. A new service deployed yesterday — with no individual alarm configured — is automatically covered. This is the architectural fix for the incident described in the introduction: move from per-log-group alarms to cluster-level metric alarms.

3. CloudWatch Alarms — The Complete Production Set

Alarm Architecture — SNS → Multi-Channel

# SNS topic — routes alerts to multiple destinations simultaneously
resource "aws_sns_topic" "platform_alerts" {
  name              = "platform-alerts"
  kms_master_key_id = aws_kms_key.sns.arn
}

# Slack notification via Lambda
resource "aws_sns_topic_subscription" "slack" {
  topic_arn = aws_sns_topic.platform_alerts.arn
  protocol  = "lambda"
  endpoint  = aws_lambda_function.slack_notifier.arn
}

# PagerDuty via HTTPS endpoint (critical alerts only)
resource "aws_sns_topic_subscription" "pagerduty" {
  topic_arn = aws_sns_topic.platform_alerts.arn
  protocol  = "https"
  endpoint  = var.pagerduty_integration_url
}

# Email — for audit trail
resource "aws_sns_topic_subscription" "email" {
  topic_arn = aws_sns_topic.platform_alerts.arn
  protocol  = "email"
  endpoint  = var.platform_team_email
}

The Minimum Production Alarm Set

locals {
    # Every alarm in this map is created automatically
  # Adding a new alarm = adding an entry here + terraform apply
  cluster_alarms = {
    pod_restarts = {
      metric     = "pod_number_of_container_restarts"
      threshold  = 5
      period     = 300
      stat       = "Sum"
      comparison = "GreaterThanThreshold"
      desc       = "Pod crash loop detected cluster-wide"
    }
    node_cpu = {
      metric     = "node_cpu_utilization"
      threshold  = 80
      period     = 900
      stat       = "Average"
      comparison = "GreaterThanThreshold"
      desc       = "Node CPU above 80% for 15 minutes"
    }
    node_memory = {
      metric     = "node_memory_utilization"
      threshold  = 85
      period     = 900
      stat       = "Average"
      comparison = "GreaterThanThreshold"
      desc       = "Node memory above 85% for 15 minutes"
    }
    cluster_failed_nodes = {
      metric     = "cluster_failed_node_count"
      threshold  = 0
      period     = 60
      stat       = "Maximum"
      comparison = "GreaterThanThreshold"
      desc       = "One or more cluster nodes in failed state"
    }
    node_count_low = {
      metric     = "cluster_node_count"
      threshold  = 2
      period     = 300
      stat       = "Minimum"
      comparison = "LessThanThreshold"
      desc       = "Cluster node count below minimum — Karpenter may not be provisioning"
    }
  }
}

resource "aws_cloudwatch_metric_alarm" "cluster" {
  for_each = local.cluster_alarms

  alarm_name          = "eks-${each.key}"
  alarm_description   = each.value.desc
  comparison_operator = each.value.comparison
  evaluation_periods  = 2
  metric_name         = each.value.metric
  namespace           = "ContainerInsights"
  period              = each.value.period
  statistic           = each.value.stat
  threshold           = each.value.threshold

  dimensions = {
    ClusterName = aws_eks_cluster.main.name
  }

  alarm_actions      = [aws_sns_topic.platform_alerts.arn]
  treat_missing_data = "notBreaching"

  tags = { Name = "cluster-alarm-${each.key}" }
}

ALB Alarms — Application Layer Health

resource "aws_cloudwatch_metric_alarm" "alb_5xx" {
  alarm_name          = "alb-5xx-errors-high"
  alarm_description   = "ALB 5xx error rate above 1% — application errors"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 3
  threshold           = 1

  metric_query {
    id          = "error_rate"
    expression  = "errors / requests * 100"
    label       = "5xx Error Rate %"
    return_data = true
  }

  metric_query {
    id = "errors"
    metric {
      namespace   = "AWS/ApplicationELB"
      metric_name = "HTTPCode_Target_5XX_Count"
      period      = 60
      stat        = "Sum"
      dimensions = { LoadBalancer = aws_lb.main.arn_suffix }
    }
  }

  metric_query {
    id = "requests"
    metric {
      namespace   = "AWS/ApplicationELB"
      metric_name = "RequestCount"
      period      = 60
      stat        = "Sum"
      dimensions = { LoadBalancer = aws_lb.main.arn_suffix }
    }
  }

  alarm_actions = [aws_sns_topic.platform_alerts.arn]
}

resource "aws_cloudwatch_metric_alarm" "alb_latency_p99" {
  alarm_name          = "alb-p99-latency-high"
  alarm_description   = "ALB P99 latency above 500ms — performance degradation"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 3
  metric_name         = "TargetResponseTime"
  namespace           = "AWS/ApplicationELB"
  period              = 60
  extended_statistic  = "p99"
  threshold           = 0.5   # 500ms

  dimensions = {
    LoadBalancer = aws_lb.main.arn_suffix
  }

  alarm_actions = [aws_sns_topic.platform_alerts.arn]
}

CI Check — Enforce Alarm Coverage for New Services

This is the CI gate that prevents the "new service, no alarm" gap structurally:

# .github/workflows/observability-gate.yaml
name: Observability Coverage Check

on:
  pull_request:
    paths:
      - 'gitops-config/apps/services/**'

jobs:
  check-alarm-coverage:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Detect new services
        id: new_services
        run: |
          new_dirs=$(git diff --name-only origin/main HEAD | \
            grep "gitops-config/apps/services/" | \
            awk -F/ '{print $3}' | sort -u)
          echo "new_services=$new_dirs" >> $GITHUB_OUTPUT

      - name: Check alarm coverage
        run: |
          for service in ${{ steps.new_services.outputs.new_services }}; do
            alarm_count=$(grep -r "$service" terraform/monitoring/ | \
              grep "aws_cloudwatch_metric_alarm" | wc -l)
            if [ "$alarm_count" -eq 0 ]; then
              echo "❌ ERROR: No CloudWatch alarm found for service: $service"
              echo "Add alarm coverage in terraform/monitoring/ before merging."
              exit 1
            fi
            echo "✅ Alarm coverage confirmed for: $service"
          done

Why this matters in production: The CI check makes alarm coverage a deployment gate, not a post-deployment task. A PR that adds a new service to the ArgoCD ApplicationSet without a corresponding alarm definition in Terraform fails the pipeline. The gap that caused the incident is now structurally impossible.

4. CloudWatch Logs — Structured Logging at Scale

Fluent Bit — Log Routing in EKS

The CloudWatch agent Helm chart deploys Fluent Bit as a DaemonSet. It collects logs from every container on every node and routes them to CloudWatch Logs — one log group per namespace, one log stream per pod.

# Fluent Bit configuration — routes logs to CloudWatch with structured metadata
apiVersion: v1
kind: ConfigMap
metadata:
  name: fluent-bit-config
  namespace: amazon-cloudwatch
data:
  fluent-bit.conf: |
    [SERVICE]
        Flush         5
        Log_Level     info
        Parsers_File  parsers.conf

    [INPUT]
        Name              tail
        Tag               kube.*
        Path              /var/log/containers/*.log
        Parser            docker
        DB                /var/log/flb_kube.db
        Mem_Buf_Limit     50MB
        Skip_Long_Lines   On
        Refresh_Interval  10

    [FILTER]
        Name                kubernetes
        Match               kube.*
        Kube_URL            https://kubernetes.default.svc:443
        Kube_CA_File        /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
        Kube_Token_File     /var/run/secrets/kubernetes.io/serviceaccount/token
        Merge_Log           On
        Keep_Log            Off
        K8S-Logging.Parser  On
        K8S-Logging.Exclude On

    [OUTPUT]
        Name              cloudwatch_logs
        Match             kube.*
        region            eu-west-1
        log_group_name    /eks/$(kubernetes['namespace_name'])
        log_stream_name   $(kubernetes['pod_name'])/$(kubernetes['container_name'])
        auto_create_group true
        retry_limit       2

auto_create_group true — this single setting means new services get their log group created automatically on first log output. No manual log group creation, no missing log groups.

CloudWatch Log Insights — Useful Queries

-- Error rate by service over the last hour
fields @timestamp, @message, kubernetes.namespace_name, kubernetes.pod_name
| filter @message like /ERROR|Exception|FATAL/
| stats count(*) as error_count by kubernetes.namespace_name
| sort error_count desc

-- Slow requests — find responses over 500ms
fields @timestamp, method, path, duration_ms, status_code
| filter duration_ms > 500
| stats avg(duration_ms) as avg_ms, max(duration_ms) as max_ms, count() as count
  by path
| sort avg_ms desc
| limit 20

-- Pod restart events in last 30 minutes
fields @timestamp, @message, kubernetes.pod_name, kubernetes.namespace_name
| filter @message like /Back-off restarting failed container/
| sort @timestamp desc
| limit 50

-- Memory OOM kills
fields @timestamp, @message, kubernetes.pod_name
| filter @message like /OOMKilled|out of memory/
| sort @timestamp desc

Log Retention — Environment-Driven

# Log retention — defined once, applied consistently
locals {
  log_retention = {
    prod    = 90
    staging = 30
    dev     = 7
  }
}

resource "aws_cloudwatch_log_group" "eks_namespace" {
  for_each = toset(var.namespaces)

  name              = "/eks/${each.key}"
  retention_in_days = local.log_retention[var.environment]
  kms_key_id        = aws_kms_key.logs.arn

  tags = { Name = "eks-logs-${each.key}", Environment = var.environment }
}

5. AWS X-Ray — Distributed Tracing

What X-Ray Solves

In a microservices architecture, a single user request touches multiple services before returning a response. CloudWatch metrics tell you something is slow. CloudWatch logs tell you what happened inside one service. Neither tells you which service in the chain is the problem.

That is what distributed tracing solves. When latency spikes or an error occurs, you need to answer:

Which service in the request path is slow?
Which downstream call is blocking?
Which external dependency — database, queue, third-party API — is the bottleneck?

X-Ray traces the complete request path across all services — with timing at each hop — so you can follow the evidence rather than guess.

User request → API Gateway → Scoring Service → Enrichment Service → S3 / RDS
                             [230ms]           [180ms]              [45ms]
                                                ↑ This is your bottleneck

X-Ray with the AWS Distro for OpenTelemetry (ADOT)

The modern approach is not X-Ray SDK instrumentation — it is OpenTelemetry instrumentation with ADOT as the collector. Your application emits standard OpenTelemetry traces. ADOT receives them and forwards to X-Ray. This gives you vendor portability — switch to Jaeger or Datadog later without re-instrumenting your applications.

# ADOT Collector via Helm — receives OTLP traces, forwards to X-Ray
resource "helm_release" "adot_collector" {
  name             = "adot-collector"
  repository       = "https://aws.github.io/eks-charts"
  chart            = "aws-otel-collector"
  version          = "0.33.0"
  namespace        = "amazon-cloudwatch"
  create_namespace = true

  values = [yamlencode({
    mode = "deployment"   # Centralised collector — not DaemonSet

    config = {
      receivers = {
        otlp = {
          protocols = {
            grpc = { endpoint = "0.0.0.0:4317" }
            http = { endpoint = "0.0.0.0:4318" }
          }
        }
      }
      processors = {
        batch = {
          timeout         = "1s"
          send_batch_size = 50
        }
        memory_limiter = {
          limit_mib       = 400
          spike_limit_mib = 100
          check_interval  = "5s"
        }
      }
      exporters = {
        awsxray = { region = var.region }
        awsemf  = {
          region                  = var.region
          namespace               = "EKS/Application"
          log_group_name          = "/aws/eks/application-metrics"
          dimension_rollup_option = "NoDimensionRollup"
        }
      }
      service = {
        pipelines = {
          traces  = { receivers = ["otlp"], processors = ["memory_limiter", "batch"], exporters = ["awsxray"] }
          metrics = { receivers = ["otlp"], processors = ["memory_limiter", "batch"], exporters = ["awsemf"] }
        }
      }
    }
  })]
}

Application Instrumentation — Python Example

# requirements.txt:
# opentelemetry-api, opentelemetry-sdk, opentelemetry-exporter-otlp-proto-grpc
# opentelemetry-instrumentation-fastapi, opentelemetry-instrumentation-requests

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor
import fastapi

otlp_exporter = OTLPSpanExporter(
    endpoint="http://adot-collector.amazon-cloudwatch.svc.cluster.local:4317",
    insecure=True
)

provider = TracerProvider()
provider.add_span_processor(BatchSpanProcessor(otlp_exporter))
trace.set_tracer_provider(provider)

app = fastapi.FastAPI()

# Auto-instrument FastAPI — all routes traced automatically
FastAPIInstrumentor.instrument_app(app)
# Auto-instrument outbound HTTP — all requests calls traced
RequestsInstrumentor().instrument()

tracer = trace.get_tracer(__name__)

@app.post("/api/v1/score")
async def score_transaction(transaction: dict):
    with tracer.start_as_current_span("score-transaction") as span:
        span.set_attribute("transaction.id", transaction["id"])
        span.set_attribute("transaction.amount", transaction["amount"])
        span.set_attribute("transaction.currency", transaction["currency"])

        result = await call_enrichment_service(transaction)

        span.set_attribute("risk.score", result["score"])
        span.set_attribute("risk.decision", result["decision"])

        return result

Kubernetes — ADOT Annotations

# Add to any pod spec to enable tracing — no code changes needed
spec:
  template:
    metadata:
      annotations:
        instrumentation.opentelemetry.io/inject-java: "true"
        instrumentation.opentelemetry.io/inject-python: "true"

6. CloudWatch Dashboards — Operational Visibility

The Platform Dashboard

resource "aws_cloudwatch_dashboard" "platform" {
  dashboard_name = "eks-platform-overview"

  dashboard_body = jsonencode({
    widgets = [
            # Row 1 — Cluster Health
      {
        type = "metric"
        properties = {
          title   = "Pod Restarts (All Namespaces)"
          region  = var.region
          metrics = [["ContainerInsights", "pod_number_of_container_restarts", "ClusterName", aws_eks_cluster.main.name]]
          view    = "timeSeries"
          period  = 300
          stat    = "Sum"
          annotations = {
            horizontal = [{ value = 5, label = "Alert threshold", color = "#d62728" }]
          }
        }
      },
      {
        type = "metric"
        properties = {
          title   = "Node CPU Utilisation"
          region  = var.region
          metrics = [["ContainerInsights", "node_cpu_utilization", "ClusterName", aws_eks_cluster.main.name]]
          view    = "timeSeries"
          period  = 60
          stat    = "Average"
        }
      },
            # Row 2 — Application Health
      {
        type = "metric"
        properties = {
          title   = "ALB P99 Latency (ms)"
          region  = var.region
          metrics = [
            ["AWS/ApplicationELB", "TargetResponseTime", "LoadBalancer", aws_lb.main.arn_suffix, { stat = "p99", period = 60, label = "P99" }],
            ["AWS/ApplicationELB", "TargetResponseTime", "LoadBalancer", aws_lb.main.arn_suffix, { stat = "p50", period = 60, label = "P50" }]
          ]
          view = "timeSeries"
        }
      },
      {
        type = "metric"
        properties = {
          title   = "ALB 5xx Error Rate"
          region  = var.region
          metrics = [
            [{ expression = "errors/requests*100", label = "Error Rate %", id = "rate" }],
            ["AWS/ApplicationELB", "HTTPCode_Target_5XX_Count", "LoadBalancer", aws_lb.main.arn_suffix, { id = "errors", visible = false }],
            ["AWS/ApplicationELB", "RequestCount", "LoadBalancer", aws_lb.main.arn_suffix, { id = "requests", visible = false }]
          ]
          view   = "timeSeries"
          period = 60
        }
      },
            # Row 3 — Cost
      {
        type = "metric"
        properties = {
          title   = "Karpenter Nodes Launched"
          region  = var.region
          metrics = [["Karpenter", "nodes_created_total", "NodePool", "spot-general"]]
          view    = "timeSeries"
          period  = 300
          stat    = "Sum"
        }
      }
    ]
  })
}

7. CloudWatch Synthetics — Proactive Endpoint Monitoring

Synthetics runs canary scripts on a schedule — simulating user requests to your endpoints from outside the cluster. If the canary fails, the alarm fires before any real user is affected.

# S3 bucket for canary scripts
resource "aws_s3_bucket" "canary_scripts" {
  bucket = "platform-canary-scripts-${var.account_id}"
}

# Canary — checks the scoring API every 5 minutes
resource "aws_synthetics_canary" "scoring_api" {
  name                 = "scoring-api-health"
  artifact_s3_location = "s3://${aws_s3_bucket.canary_scripts.bucket}/canary-results/"
  execution_role_arn   = aws_iam_role.canary.arn
  handler              = "apiCanary.handler"
  runtime_version      = "syn-nodejs-puppeteer-7.0"
  start_canary         = true

  schedule {
    expression = "rate(5 minutes)"
  }

  run_config {
    timeout_in_seconds = 30
    memory_in_mb       = 960
    active_tracing     = true   # Enable X-Ray tracing for canary runs
  }

  zip_file = data.archive_file.canary_script.output_base64sha256
}

// canary/apiCanary.js — runs every 5 minutes from CloudWatch
const synthetics = require('Synthetics');
const log        = require('SyntheticsLogger');

const apiCanary = async function () {
  const healthUrl = 'https://api.internal.company.com/health';

  const response = await synthetics.executeHttpStep(
    'Check health endpoint',
    healthUrl,
    { method: 'GET', headers: { 'x-canary': 'true' } }
  );

  if (response.statusCode !== 200) {
    throw new Error(`Health check failed: ${response.statusCode}`);
  }

  const scoreUrl  = 'https://api.internal.company.com/api/v1/score';
  const scoreBody = JSON.stringify({
    transaction_id: 'canary-test-001',
    amount:          1.00,
    currency:        'EUR'
  });

  const scoreResponse = await synthetics.executeHttpStep(
    'Check scoring endpoint',
    scoreUrl,
    {
      method:  'POST',
      headers: { 'Content-Type': 'application/json', 'x-canary': 'true' },
      body:    scoreBody
    }
  );

  if (scoreResponse.statusCode !== 200) {
    throw new Error(`Scoring endpoint failed: ${scoreResponse.statusCode}`);
  }

  const body = JSON.parse(scoreResponse.body);
  if (!body.decision) {
    throw new Error('Scoring response missing decision field');
  }

  log.info('Canary passed — API healthy');
};

exports.handler = async () => { await apiCanary(); };

# Alarm on canary failures
resource "aws_cloudwatch_metric_alarm" "canary_failure" {
  alarm_name          = "scoring-api-canary-failed"
  alarm_description   = "Scoring API canary check failed — endpoint may be down"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 2
  metric_name         = "Failed"
  namespace           = "CloudWatchSynthetics"
  period              = 300
  statistic           = "Sum"
  threshold           = 0

  dimensions = {
    CanaryName = aws_synthetics_canary.scoring_api.name
  }

  alarm_actions = [
    aws_sns_topic.platform_alerts.arn,
    aws_sns_topic.pagerduty.arn
  ]
}

8. What's Still Missing — The Honest Gaps

Every observability stack has gaps. Being honest about them is what separates operational maturity from a checkbox exercise.

Gap 1 — No Alerting on What You Don't Know About

Container Insights covers pod restarts, CPU, and memory. It does not cover:

Business metrics — transaction throughput, fraud detection rate, payment success rate. These require custom CloudWatch metrics emitted by your application.
Queue depth — SQS queue depth, Kafka consumer lag. These require separate alarms on AWS/SQS and custom metrics from your consumers.
Database connection pool exhaustion — RDS connection count approaching max_connections is invisible until queries start failing.

# Custom metric — application-level business signal
resource "aws_cloudwatch_metric_alarm" "sqs_queue_depth" {
  alarm_name          = "enrichment-queue-depth-high"
  alarm_description   = "SQS queue depth high — enrichment service may be falling behind"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 3
  metric_name         = "ApproximateNumberOfMessagesVisible"
  namespace           = "AWS/SQS"
  period              = 300
  statistic           = "Average"
  threshold           = 1000

  dimensions = {
    QueueName = aws_sqs_queue.enrichment_input.name
  }

  alarm_actions = [aws_sns_topic.platform_alerts.arn]
}

Gap 2 — No Correlation Between Logs and Traces

CloudWatch Logs and X-Ray traces are separate consoles. You cannot click from a log line to the trace that produced it — unless you embed the trace ID in your log output:

import logging
import json
from opentelemetry import trace

logger = logging.getLogger(__name__)

class TraceIdFilter(logging.Filter):
    """Injects the current OpenTelemetry trace ID into every log record."""
    def filter(self, record):
        span = trace.get_current_span()
        if span and span.is_recording():
            ctx = span.get_span_context()
            trace_id = format(ctx.trace_id, '032x')
            record.trace_id = f"1-{trace_id[:8]}-{trace_id[8:]}"
            record.span_id  = format(ctx.span_id, '016x')
        else:
            record.trace_id = "no-trace"
            record.span_id  = "no-span"
        return True

# Apply filter to root logger — every log line gets trace context
logging.getLogger().addFilter(TraceIdFilter())

# Structured JSON logging — makes log correlation queryable in CloudWatch Insights
logging.basicConfig(
    format=json.dumps({
        "timestamp": "%(asctime)s",
        "level":     "%(levelname)s",
        "message":   "%(message)s",
        "trace_id":  "%(trace_id)s",
        "span_id":   "%(span_id)s",
        "service":   "scoring-service",
        "namespace": "fraud-detection"
    })
)

def process_request(request_id: str):
    # trace_id is automatically injected into this log line
    logger.info(f"Processing request {request_id}")

Gap 3 — Alert Fatigue Without Runbooks

Alarms without runbooks create alert fatigue — engineers see the alarm, don't know what to do, and start ignoring alerts. Every alarm should link to a runbook:

resource "aws_cloudwatch_metric_alarm" "pod_restarts" {
  alarm_description = <<-EOT
    Pod restarting repeatedly — likely crash loop.
    Runbook: https://wiki.internal.company.com/runbooks/pod-crash-loop
    1. Check Container Insights for the specific pod
    2. Open CloudWatch Logs for the namespace
    3. Look for OOMKilled or configuration errors
    4. If config change, revert the last ArgoCD sync
  EOT
    # ... rest of alarm config
}

Gap 4 — No SLO Tracking

Alarms tell you when something is broken right now. SLOs (Service Level Objectives) tell you how often things were broken over time — and whether you're within your error budget. CloudWatch does not have native SLO tracking. Options:

AWS Application Signals — AWS-managed SLO framework, released 2024
Prometheus + Grafana — sloth or manual recording rules for SLO burn rate
Datadog / New Relic — if you have budget for a third-party observability platform

# AWS Application Signals — enables SLO tracking natively
resource "aws_applicationsignals_service_level_objective" "scoring_latency" {
  name        = "scoring-api-p99-latency"
  description = "P99 latency below 200ms — 99.5% of the time"

  sli = {
    sli_metric = {
      metric_data_queries = [{
        id          = "latency"
        metric_stat = {
          metric = {
            namespace   = "AWS/ApplicationELB"
            metric_name = "TargetResponseTime"
            dimensions  = [{ name = "LoadBalancer", value = aws_lb.main.arn_suffix }]
          }
          period = 60
          stat   = "p99"
        }
      }]
    }
    comparison_operator = "LessThan"
    metric_threshold    = 0.2   # 200ms
  }

  goal = {
    attainment_goal   = 99.5
    warning_threshold = 99.0
    interval = {
      rolling_interval = { duration = 7, duration_unit = "DAY" }
    }
  }
}

9. The Observability Maturity Model

Not every team needs every tool on day one. Here's the progression:

Level	What You Have	What You're Flying Blind On
Level 1	CloudWatch Logs, basic EC2 alarms	Pod-level failures, latency distribution
Level 2	+ Container Insights, ALB metrics, pod restart alarms	Request traces, business metrics
Level 3	+ X-Ray / ADOT tracing, log correlation	SLO tracking, user-facing quality
Level 4	+ Synthetics, custom business metrics, SLOs	Prediction, anomaly detection
Level 5	+ Anomaly detection, AIOps, full SLO budget tracking	Nothing — this is operational maturity

Start at Level 2 on Day 1. Container Insights + pod restart alarms + ALB metrics cover the majority of production incidents. Add tracing when you have more than 5 services and start seeing latency issues you cannot localise to one service. Add Synthetics when you have a public-facing API that must be monitored from the outside.

10. Common Mistakes & Anti-Patterns

Mistake 1: Log-Based Alarms for Pod Failures

Log-based alarms require a metric filter per log group per service. Every new service needs manual wiring. Use Container Insights pod_number_of_container_restarts — automatic coverage for every pod from the moment it starts.

Mistake 2: Alarms Without Multi-Channel Notification

A single SNS email subscription means the alarm fires at 2am and nobody sees it until morning. Always route critical alarms to at least two channels: Slack for visibility and PagerDuty/OpsGenie for on-call paging.

Mistake 3: No `treat_missing_data` Configuration

By default, an alarm with no data reports INSUFFICIENT_DATA — which does not trigger alarm actions. Set treat_missing_data = "breaching" for availability alarms (missing data = service is down) and "notBreaching" for load alarms (missing data = service is idle).

Mistake 4: Tracing Without Log Correlation

X-Ray traces and CloudWatch logs are in separate consoles. Without the trace ID embedded in log output, correlating a slow trace to the log lines that explain it requires manual timestamp matching. Embed the trace ID in structured log output from Day 1.

Mistake 5: Dashboard Without Annotations

A CloudWatch dashboard showing a metric spike is useful. A dashboard showing a metric spike with an annotation that says "Deployed v1.4.2 at 14:32" is invaluable. Automate deployment annotations via EventBridge → CloudWatch dashboard annotations.

Mistake 6: Alarms Without Runbooks

An alarm that fires at 2am and contains no guidance creates panic and delay. Every alarm description should contain: what it means, the immediate diagnostic steps, and a runbook link. This is free — it's a text field on the alarm resource.

Architecture Decision Matrix

Requirement	CloudWatch	X-Ray / ADOT	Prometheus + Grafana
Infrastructure metrics	✅ Native	❌ N/A	⚠️ Requires exporters
Pod / container metrics	✅ Container Insights	❌ N/A	✅ kube-state-metrics
Application logs	✅ Native	❌ N/A	❌ N/A
Distributed tracing	❌ N/A	✅ Native	⚠️ Jaeger / Tempo
Custom dashboards	⚠️ Limited	❌ N/A	✅ Best in class
Alerting	✅ Alarms	❌ N/A	✅ Alertmanager
AWS-native integration	✅ Deepest	✅ Deep	⚠️ Via exporters
Cost	✅ Pay per use	✅ Pay per trace	⚠️ EC2/node cost
SLO tracking	✅ App Signals	❌ N/A	✅ With sloth/recording rules
No infrastructure to manage	✅ Fully managed	✅ Fully managed	❌ You manage it

The Golden Rule

"Observability is not a tool you deploy — it is a property you build in from the start. Instrument before you deploy, not after the first incident. Use metric-based alarms over log-based alarms for pod failures — metrics cover new services automatically, log groups do not. Correlate metrics, logs, and traces so you can follow the evidence from alarm to root cause without guessing. And treat every alarm as a contract: if it fires, there must be a runbook, a responsible team, and a defined response time. An alarm nobody acts on is noise. Noise trains engineers to ignore alarms. Ignored alarms mean incidents go undetected. The goal is not more alarms — it is the right alarms, routing to the right people, with the context to act immediately."

Tags: #AWS #CloudWatch #Observability #XRay #OpenTelemetry #EKS #Kubernetes #DevSecOps #SRE #ContainerInsights #ADOT #Monitoring

Ankush Panday

Specializing in highly scalable AWS infrastructure and automated quality engineering.

Connect on LinkedIn