Skip to content

Monitoring and Observability

The AI Gateway exports telemetry through three channels: structured logs to CloudWatch Logs, distributed traces to X-Ray, and custom metrics via CloudWatch Embedded Metric Format (EMF). An ADOT sidecar container in each ECS task handles the export pipeline.

flowchart LR
    subgraph task["ECS Task"]
        GW["Portkey Gateway\nPort 8787"]
        ADOT["ADOT Sidecar\nOTel Collector"]
    end

    subgraph cw["CloudWatch"]
        LOG1["/ecs/ai-gateway/gateway\nContainer Logs"]
        LOG2["/ecs/ai-gateway/otel\nCollector Logs"]
        OTEL_LOG["/ecs/ai-gateway/otel-logs\nOTLP Logs"]
        EMF["AIGateway Namespace\nEMF Metrics"]
        DASH["CloudWatch Dashboard\n4 Widgets"]
    end

    XR["AWS X-Ray\nDistributed Traces"]

    GW -->|"stdout/stderr\n(awslogs driver)"| LOG1
    ADOT -->|"stdout/stderr\n(awslogs driver)"| LOG2
    GW -->|"OTLP gRPC\nlocalhost:4317"| ADOT
    ADOT -->|"awsxray exporter"| XR
    ADOT -->|"awsemf exporter"| EMF
    ADOT -->|"awscloudwatchlogs\nexporter"| OTEL_LOG
    LOG1 --> DASH
    EMF --> DASH
Log GroupSourceRetentionEncryption
/ecs/ai-gateway/gatewayPortkey gateway container (pino JSON via Fastify)365 daysKMS (alias/ai-gateway-logs)
/ecs/ai-gateway/otelADOT sidecar container operational logs365 daysKMS (alias/ai-gateway-logs)
/ecs/ai-gateway/otel-logsOTLP logs exported by the collector pipeline365 daysKMS (alias/ai-gateway-logs)
/ecs/ai-gateway/metricsEMF-formatted metrics from the collector365 daysKMS (alias/ai-gateway-logs)
aws-waf-logs-ai-gateway-{env}WAF request logs (only when WAF is enabled)365 daysKMS (alias/ai-gateway-logs)

The ADOT sidecar runs the AWS Distro for OpenTelemetry (public.ecr.aws/aws-observability/aws-otel-collector:latest) with the following pipeline configuration (defined in infrastructure/otel-config.yaml):

ReceiverProtocolEndpoint
OTLPgRPClocalhost:4317
OTLPHTTPlocalhost:4318
ProcessorConfiguration
memory_limitercheck_interval: 1s, limit_mib: 100
batchtimeout: 5s, send_batch_size: 512
resourceUpserts service.name = ai-gateway
PipelineProcessorsExporterDestination
Tracesmemory_limiter, batchawsxrayAWS X-Ray
Metricsmemory_limiter, batchawsemfCloudWatch Metrics (namespace: AIGateway, log group: /ecs/ai-gateway/metrics)
Logsmemory_limiter, batchawscloudwatchlogsCloudWatch Logs (/ecs/ai-gateway/otel-logs)

Four pre-built queries are deployed as CloudWatch saved queries, targeting the gateway log group. All query the pino JSON logs emitted by the Portkey Fastify server.

Saved as: ai-gateway/requests-per-hour-by-provider

fields @timestamp, @message
| filter ispresent(responseTime)
| stats count(*) as requests by bin(1h), `req.headers.x-portkey-provider` as provider
| sort bin(1h) desc

Saved as: ai-gateway/error-rate-by-provider

fields @timestamp, @message
| filter ispresent(res.statusCode)
| stats count(*) as total,
sum(res.statusCode >= 400) as errors,
(sum(res.statusCode >= 400) / count(*)) * 100 as error_pct
by `req.headers.x-portkey-provider` as provider
| sort error_pct desc

Saved as: ai-gateway/latency-percentiles-by-provider

fields @timestamp, responseTime
| filter ispresent(responseTime)
| stats pct(responseTime, 50) as p50,
pct(responseTime, 95) as p95,
pct(responseTime, 99) as p99,
avg(responseTime) as avg_ms
by `req.headers.x-portkey-provider` as provider
| sort p99 desc

Saved as: ai-gateway/requests-by-endpoint

fields @timestamp, req.url
| filter ispresent(req.url)
| stats count(*) as requests by `req.url` as endpoint
| sort requests desc
| limit 20

The dashboard ai-gateway-{environment} contains 4 widgets arranged in a 2x2 grid:

PositionWidgetTypeData Source
Top-leftRequests per Hour by ProviderTime seriesGateway log group (Logs Insights)
Top-rightError Rate by ProviderTableGateway log group (Logs Insights)
Bottom-leftLatency Percentiles by Provider (ms)TableGateway log group (Logs Insights)
Bottom-rightTop Endpoints by Request CountTableGateway log group (Logs Insights)

The scripts/cw-queries.sh script provides a convenient way to run the saved queries from the command line:

Terminal window
# Run all 4 queries (default: last 1 hour)
./scripts/cw-queries.sh
# Run a specific query
./scripts/cw-queries.sh requests
./scripts/cw-queries.sh errors
./scripts/cw-queries.sh latency
./scripts/cw-queries.sh endpoints
VariableDefaultDescription
LOG_GROUP/ecs/ai-gateway/gatewayTarget CloudWatch log group
START_TIME1 hour ago (epoch seconds)Query start time
END_TIMENow (epoch seconds)Query end time
Terminal window
# Query the last 24 hours
START_TIME=$(date -d '24 hours ago' +%s) ./scripts/cw-queries.sh
# Query a different log group
LOG_GROUP=/ecs/ai-gateway/otel ./scripts/cw-queries.sh
# Query a specific time range
START_TIME=1711000000 END_TIME=1711003600 ./scripts/cw-queries.sh errors
MetricSourceHealthy RangeAction if Breached
Request rateLogs Insights (requests/hour)Baseline +/- 50%Investigate traffic spikes; check if autoscaling is responding
Error rate (4xx + 5xx)Logs Insights (error_pct)< 5%Check provider API status; review error logs for patterns
p50 latencyLogs Insights (latency_percentiles)< 500msNormal range varies by model; investigate if suddenly increases
p99 latencyLogs Insights (latency_percentiles)< 5000msMay indicate provider throttling or network issues
MetricSourceHealthy RangeAction if Breached
ECS running task countCloudWatch ECS metrics>= autoscaling_min_capacityCheck ECS events for task failures; verify health checks
CPU utilizationCloudWatch ECS metrics< 70% (autoscaling target)Autoscaling should handle; increase autoscaling_max_capacity if at limit
ALB request countCloudWatch ALB metrics< 500/target (autoscaling target)Autoscaling should handle; review if tasks are scaling appropriately
ALB 5xx countCloudWatch ALB metrics0Check ECS task health; review gateway logs for crashes
WAF blocked requestsCloudWatch WAF metricsLow, non-zeroReview WAF logs for false positives; adjust rules if legitimate traffic is blocked
SignalFirst CheckDeep Dive
High error rateDashboard “Error Rate by Provider” widgetLogs Insights: filter by status code and provider
Slow responsesDashboard “Latency Percentiles” widgetX-Ray traces: look for slow spans
No trafficALB target group health in ECS consoleECS task events and container health checks
Task crashesECS service events tabGateway log group: look for fatal/error level logs
WAF blockingWAF metrics in CloudWatchWAF log group: aws-waf-logs-ai-gateway-{env}