Monitoring and Observability

The AI Gateway exports telemetry through three channels: structured logs to CloudWatch Logs, distributed traces to X-Ray, and custom metrics via CloudWatch Embedded Metric Format (EMF). An ADOT sidecar container in each ECS task handles the export pipeline.

Telemetry Pipeline

flowchart LR
    subgraph task["ECS Task"]
        GW["agentgateway\nPort 8787"]
        ADOT["ADOT Sidecar\nOTel Collector"]
    end

    subgraph cw["CloudWatch"]
        LOG1["/ecs/ai-gateway/gateway\nContainer Logs"]
        LOG2["/ecs/ai-gateway/otel\nCollector Logs"]
        OTEL_LOG["/ecs/ai-gateway/otel-logs\nOTLP Logs"]
        EMF["AIGateway Namespace\nEMF Metrics"]
        DASH["CloudWatch Dashboard\n4 Widgets"]
    end

    XR["AWS X-Ray\nDistributed Traces"]

    GW -->|"stdout/stderr\n(awslogs driver)"| LOG1
    ADOT -->|"stdout/stderr\n(awslogs driver)"| LOG2
    GW -->|"OTLP gRPC\nlocalhost:4317"| ADOT
    ADOT -->|"awsxray exporter"| XR
    ADOT -->|"awsemf exporter"| EMF
    ADOT -->|"awscloudwatchlogs\nexporter"| OTEL_LOG
    LOG1 --> DASH
    EMF --> DASH

CloudWatch Log Groups

Log Group	Source	Retention	Encryption
`/ecs/ai-gateway/gateway`	agentgateway container (structured JSON access log)	365 days	KMS (`alias/ai-gateway-logs`)
`/ecs/ai-gateway/otel`	ADOT sidecar container operational logs	365 days	KMS (`alias/ai-gateway-logs`)
`/ecs/ai-gateway/otel-logs`	OTLP logs exported by the collector pipeline	365 days	KMS (`alias/ai-gateway-logs`)
`/ecs/ai-gateway/metrics`	EMF-formatted metrics from the collector	365 days	KMS (`alias/ai-gateway-logs`)
`aws-waf-logs-ai-gateway-{env}`	WAF request logs (only when WAF is enabled)	365 days	KMS (`alias/ai-gateway-logs`)

OpenTelemetry Collector Configuration

The ADOT sidecar runs the AWS Distro for OpenTelemetry (public.ecr.aws/aws-observability/aws-otel-collector:latest) with the following pipeline configuration (defined in infrastructure/otel-config.yaml):

Receivers

Receiver	Protocol	Endpoint
OTLP	gRPC	`localhost:4317`
OTLP	HTTP	`localhost:4318`

Processors

Processor	Configuration
`memory_limiter`	`check_interval: 1s`, `limit_mib: 100`
`batch`	`timeout: 5s`, `send_batch_size: 512`
`resource`	Upserts `service.name = ai-gateway`

Exporters and Pipelines

Pipeline	Processors	Exporter	Destination
Traces	memory_limiter, batch	`awsxray`	AWS X-Ray
Metrics	memory_limiter, batch	`awsemf`	CloudWatch Metrics (namespace: `AIGateway`, log group: `/ecs/ai-gateway/metrics`)
Logs	memory_limiter, batch	`awscloudwatchlogs`	CloudWatch Logs (`/ecs/ai-gateway/otel-logs`)

Saved CloudWatch Logs Insights Queries

Pre-built queries are deployed as CloudWatch saved queries, targeting the gateway log group. All query the structured JSON access log agentgateway emits, where provider and model are flat fields re-keyed by the config’s accessLog.add map.

1. Requests per Hour by Provider

Saved as: ai-gateway/requests-per-hour-by-provider

fields @timestamp, @message
| filter ispresent(responseTime)
| stats count(*) as requests by bin(1h), provider
| sort bin(1h) desc

2. Error Rate by Provider

Saved as: ai-gateway/error-rate-by-provider

fields @timestamp, @message
| filter ispresent(res.statusCode)
| stats count(*) as total,
        sum(res.statusCode >= 400) as errors,
        (sum(res.statusCode >= 400) / count(*)) * 100 as error_pct
  by provider
| sort error_pct desc

3. Latency Percentiles by Provider

Saved as: ai-gateway/latency-percentiles-by-provider

fields @timestamp, responseTime, provider, model
| filter ispresent(responseTime)
| stats pct(responseTime, 50) as p50,
        pct(responseTime, 95) as p95,
        pct(responseTime, 99) as p99,
        avg(responseTime) as avg_ms
  by provider, model
| sort p99 desc

4. Requests by Endpoint

Saved as: ai-gateway/requests-by-endpoint

fields @timestamp, req.url
| filter ispresent(req.url)
| stats count(*) as requests by `req.url` as endpoint
| sort requests desc
| limit 20

CloudWatch Dashboard

The dashboard ai-gateway-{environment} contains 4 widgets arranged in a 2x2 grid:

Position	Widget	Type	Data Source
Top-left	Requests per Hour by Provider	Time series	Gateway log group (Logs Insights)
Top-right	Error Rate by Provider	Table	Gateway log group (Logs Insights)
Bottom-left	Latency Percentiles by Provider (ms)	Table	Gateway log group (Logs Insights)
Bottom-right	Top Endpoints by Request Count	Table	Gateway log group (Logs Insights)

Running Queries via CLI

The scripts/cw-queries.sh script provides a convenient way to run the saved queries from the command line:

# Run all 4 queries (default: last 1 hour)
./scripts/cw-queries.sh

# Run a specific query
./scripts/cw-queries.sh requests
./scripts/cw-queries.sh errors
./scripts/cw-queries.sh latency
./scripts/cw-queries.sh endpoints

Environment Variables

Variable	Default	Description
`LOG_GROUP`	`/ecs/ai-gateway/gateway`	Target CloudWatch log group
`START_TIME`	1 hour ago (epoch seconds)	Query start time
`END_TIME`	Now (epoch seconds)	Query end time

Examples

# Query the last 24 hours
START_TIME=$(date -d '24 hours ago' +%s) ./scripts/cw-queries.sh

# Query a different log group
LOG_GROUP=/ecs/ai-gateway/otel ./scripts/cw-queries.sh

# Query a specific time range
START_TIME=1711000000 END_TIME=1711003600 ./scripts/cw-queries.sh errors

Key Metrics to Watch

Operational Health

Metric	Source	Healthy Range	Action if Breached
Request rate	Logs Insights (requests/hour)	Baseline +/- 50%	Investigate traffic spikes; check if autoscaling is responding
Error rate (4xx + 5xx)	Logs Insights (error_pct)	< 5%	Check provider API status; review error logs for patterns
p50 latency	Logs Insights (latency_percentiles)	< 500ms	Normal range varies by model; investigate if suddenly increases
p99 latency	Logs Insights (latency_percentiles)	< 5000ms	May indicate provider throttling or network issues

Infrastructure Health

Metric	Source	Healthy Range	Action if Breached
ECS running task count	CloudWatch ECS metrics	>= `autoscaling_min_capacity`	Check ECS events for task failures; verify health checks
CPU utilization	CloudWatch ECS metrics	< 70% (autoscaling target)	Autoscaling should handle; increase `autoscaling_max_capacity` if at limit
ALB request count	CloudWatch ALB metrics	< 500/target (autoscaling target)	Autoscaling should handle; review if tasks are scaling appropriately
ALB 5xx count	CloudWatch ALB metrics	0	Check ECS task health; review gateway logs for crashes
WAF blocked requests	CloudWatch WAF metrics	Low, non-zero	Review WAF logs for false positives; adjust rules if legitimate traffic is blocked

Where to Look

Signal	First Check	Deep Dive
High error rate	Dashboard “Error Rate by Provider” widget	Logs Insights: filter by status code and provider
Slow responses	Dashboard “Latency Percentiles” widget	X-Ray traces: look for slow spans
No traffic	ALB target group health in ECS console	ECS task events and container health checks
Task crashes	ECS service events tab	Gateway log group: look for fatal/error level logs
WAF blocking	WAF metrics in CloudWatch	WAF log group: `aws-waf-logs-ai-gateway-{env}`