Monitoring and Observability
The AI Gateway exports telemetry through three channels: structured logs to CloudWatch Logs, distributed traces to X-Ray, and custom metrics via CloudWatch Embedded Metric Format (EMF). An ADOT sidecar container in each ECS task handles the export pipeline.
Telemetry Pipeline
Section titled “Telemetry Pipeline”flowchart LR
subgraph task["ECS Task"]
GW["agentgateway\nPort 8787"]
ADOT["ADOT Sidecar\nOTel Collector"]
end
subgraph cw["CloudWatch"]
LOG1["/ecs/ai-gateway/gateway\nContainer Logs"]
LOG2["/ecs/ai-gateway/otel\nCollector Logs"]
OTEL_LOG["/ecs/ai-gateway/otel-logs\nOTLP Logs"]
EMF["AIGateway Namespace\nEMF Metrics"]
DASH["CloudWatch Dashboard\n4 Widgets"]
end
XR["AWS X-Ray\nDistributed Traces"]
GW -->|"stdout/stderr\n(awslogs driver)"| LOG1
ADOT -->|"stdout/stderr\n(awslogs driver)"| LOG2
GW -->|"OTLP gRPC\nlocalhost:4317"| ADOT
ADOT -->|"awsxray exporter"| XR
ADOT -->|"awsemf exporter"| EMF
ADOT -->|"awscloudwatchlogs\nexporter"| OTEL_LOG
LOG1 --> DASH
EMF --> DASH
CloudWatch Log Groups
Section titled “CloudWatch Log Groups”| Log Group | Source | Retention | Encryption |
|---|---|---|---|
/ecs/ai-gateway/gateway | agentgateway container (structured JSON access log) | 365 days | KMS (alias/ai-gateway-logs) |
/ecs/ai-gateway/otel | ADOT sidecar container operational logs | 365 days | KMS (alias/ai-gateway-logs) |
/ecs/ai-gateway/otel-logs | OTLP logs exported by the collector pipeline | 365 days | KMS (alias/ai-gateway-logs) |
/ecs/ai-gateway/metrics | EMF-formatted metrics from the collector | 365 days | KMS (alias/ai-gateway-logs) |
aws-waf-logs-ai-gateway-{env} | WAF request logs (only when WAF is enabled) | 365 days | KMS (alias/ai-gateway-logs) |
OpenTelemetry Collector Configuration
Section titled “OpenTelemetry Collector Configuration”The ADOT sidecar runs the AWS Distro for OpenTelemetry (public.ecr.aws/aws-observability/aws-otel-collector:latest) with the following pipeline configuration (defined in infrastructure/otel-config.yaml):
Receivers
Section titled “Receivers”| Receiver | Protocol | Endpoint |
|---|---|---|
| OTLP | gRPC | localhost:4317 |
| OTLP | HTTP | localhost:4318 |
Processors
Section titled “Processors”| Processor | Configuration |
|---|---|
memory_limiter | check_interval: 1s, limit_mib: 100 |
batch | timeout: 5s, send_batch_size: 512 |
resource | Upserts service.name = ai-gateway |
Exporters and Pipelines
Section titled “Exporters and Pipelines”| Pipeline | Processors | Exporter | Destination |
|---|---|---|---|
| Traces | memory_limiter, batch | awsxray | AWS X-Ray |
| Metrics | memory_limiter, batch | awsemf | CloudWatch Metrics (namespace: AIGateway, log group: /ecs/ai-gateway/metrics) |
| Logs | memory_limiter, batch | awscloudwatchlogs | CloudWatch Logs (/ecs/ai-gateway/otel-logs) |
Saved CloudWatch Logs Insights Queries
Section titled “Saved CloudWatch Logs Insights Queries”Pre-built queries are deployed as CloudWatch saved queries, targeting the gateway log group. All query the structured JSON access log agentgateway emits, where provider and model are flat fields re-keyed by the config’s accessLog.add map.
1. Requests per Hour by Provider
Section titled “1. Requests per Hour by Provider”Saved as: ai-gateway/requests-per-hour-by-provider
fields @timestamp, @message| filter ispresent(responseTime)| stats count(*) as requests by bin(1h), provider| sort bin(1h) desc2. Error Rate by Provider
Section titled “2. Error Rate by Provider”Saved as: ai-gateway/error-rate-by-provider
fields @timestamp, @message| filter ispresent(res.statusCode)| stats count(*) as total, sum(res.statusCode >= 400) as errors, (sum(res.statusCode >= 400) / count(*)) * 100 as error_pct by provider| sort error_pct desc3. Latency Percentiles by Provider
Section titled “3. Latency Percentiles by Provider”Saved as: ai-gateway/latency-percentiles-by-provider
fields @timestamp, responseTime, provider, model| filter ispresent(responseTime)| stats pct(responseTime, 50) as p50, pct(responseTime, 95) as p95, pct(responseTime, 99) as p99, avg(responseTime) as avg_ms by provider, model| sort p99 desc4. Requests by Endpoint
Section titled “4. Requests by Endpoint”Saved as: ai-gateway/requests-by-endpoint
fields @timestamp, req.url| filter ispresent(req.url)| stats count(*) as requests by `req.url` as endpoint| sort requests desc| limit 20CloudWatch Dashboard
Section titled “CloudWatch Dashboard”The dashboard ai-gateway-{environment} contains 4 widgets arranged in a 2x2 grid:
| Position | Widget | Type | Data Source |
|---|---|---|---|
| Top-left | Requests per Hour by Provider | Time series | Gateway log group (Logs Insights) |
| Top-right | Error Rate by Provider | Table | Gateway log group (Logs Insights) |
| Bottom-left | Latency Percentiles by Provider (ms) | Table | Gateway log group (Logs Insights) |
| Bottom-right | Top Endpoints by Request Count | Table | Gateway log group (Logs Insights) |
Running Queries via CLI
Section titled “Running Queries via CLI”The scripts/cw-queries.sh script provides a convenient way to run the saved queries from the command line:
# Run all 4 queries (default: last 1 hour)./scripts/cw-queries.sh
# Run a specific query./scripts/cw-queries.sh requests./scripts/cw-queries.sh errors./scripts/cw-queries.sh latency./scripts/cw-queries.sh endpointsEnvironment Variables
Section titled “Environment Variables”| Variable | Default | Description |
|---|---|---|
LOG_GROUP | /ecs/ai-gateway/gateway | Target CloudWatch log group |
START_TIME | 1 hour ago (epoch seconds) | Query start time |
END_TIME | Now (epoch seconds) | Query end time |
Examples
Section titled “Examples”# Query the last 24 hoursSTART_TIME=$(date -d '24 hours ago' +%s) ./scripts/cw-queries.sh
# Query a different log groupLOG_GROUP=/ecs/ai-gateway/otel ./scripts/cw-queries.sh
# Query a specific time rangeSTART_TIME=1711000000 END_TIME=1711003600 ./scripts/cw-queries.sh errorsKey Metrics to Watch
Section titled “Key Metrics to Watch”Operational Health
Section titled “Operational Health”| Metric | Source | Healthy Range | Action if Breached |
|---|---|---|---|
| Request rate | Logs Insights (requests/hour) | Baseline +/- 50% | Investigate traffic spikes; check if autoscaling is responding |
| Error rate (4xx + 5xx) | Logs Insights (error_pct) | < 5% | Check provider API status; review error logs for patterns |
| p50 latency | Logs Insights (latency_percentiles) | < 500ms | Normal range varies by model; investigate if suddenly increases |
| p99 latency | Logs Insights (latency_percentiles) | < 5000ms | May indicate provider throttling or network issues |
Infrastructure Health
Section titled “Infrastructure Health”| Metric | Source | Healthy Range | Action if Breached |
|---|---|---|---|
| ECS running task count | CloudWatch ECS metrics | >= autoscaling_min_capacity | Check ECS events for task failures; verify health checks |
| CPU utilization | CloudWatch ECS metrics | < 70% (autoscaling target) | Autoscaling should handle; increase autoscaling_max_capacity if at limit |
| ALB request count | CloudWatch ALB metrics | < 500/target (autoscaling target) | Autoscaling should handle; review if tasks are scaling appropriately |
| ALB 5xx count | CloudWatch ALB metrics | 0 | Check ECS task health; review gateway logs for crashes |
| WAF blocked requests | CloudWatch WAF metrics | Low, non-zero | Review WAF logs for false positives; adjust rules if legitimate traffic is blocked |
Where to Look
Section titled “Where to Look”| Signal | First Check | Deep Dive |
|---|---|---|
| High error rate | Dashboard “Error Rate by Provider” widget | Logs Insights: filter by status code and provider |
| Slow responses | Dashboard “Latency Percentiles” widget | X-Ray traces: look for slow spans |
| No traffic | ALB target group health in ECS console | ECS task events and container health checks |
| Task crashes | ECS service events tab | Gateway log group: look for fatal/error level logs |
| WAF blocking | WAF metrics in CloudWatch | WAF log group: aws-waf-logs-ai-gateway-{env} |