Architecture
This page provides a complete mental model of the AI Gateway system: how requests flow from client agents through the infrastructure, how modules are organized, and why key design decisions were made.
High-Level System Architecture
Section titled “High-Level System Architecture”The gateway sits between AI coding agents and LLM model providers, handling authentication, routing, and observability.
flowchart LR
subgraph Clients
A1[Claude Code]
A2[OpenCode]
A3[Goose / Continue /<br>LangChain / Codex]
end
subgraph AWS Cloud
subgraph Public Subnets
WAF[WAF v2<br>Rate Limiting +<br>AWS Managed Rules]
ALB[ALB<br>TLS 1.3 +<br>JWT Validation]
end
subgraph Private Subnets
GW[Portkey Gateway<br>Port 8787]
OTEL[OTel Sidecar<br>Collector]
end
COG[Cognito<br>M2M Token Issuer]
ECR[ECR<br>Container Registry]
SM[Secrets Manager<br>Provider API Keys]
CW[CloudWatch<br>Logs + Metrics]
XRAY[X-Ray<br>Traces]
KMS[KMS<br>Log Encryption]
end
subgraph Providers
BED[AWS Bedrock]
OAI[OpenAI]
ANT[Anthropic]
GOO[Google Vertex AI]
AZR[Azure OpenAI]
end
A1 -->|/v1/messages| ALB
A2 -->|/v1/chat/completions| ALB
A3 -->|/v1/chat/completions| ALB
WAF --- ALB
ALB -->|JWT valid| GW
COG -.->|JWKS| ALB
ECR -.->|Image pull| GW
SM -.->|API keys| GW
GW --- OTEL
OTEL --> CW
OTEL --> XRAY
KMS -.->|Encryption| CW
GW --> BED
GW --> OAI
GW --> ANT
GW --> GOO
GW --> AZR
Two-Plane Architecture
Section titled “Two-Plane Architecture”The gateway splits traffic into two planes (ADR-014):
| Plane | Transport | Auth | Endpoints | Traffic Pattern |
|---|---|---|---|---|
| Inference | ALB | ALB-native JWT | /v1/chat/completions, /v1/messages | High-volume, latency-sensitive |
| Admin | API Gateway REST API | Cognito Authorizer | /teams, /budgets, /routing, /scanner, /pricing, /usage | Low-volume, correctness-sensitive |
The ALB handles inference requests with zero per-request cost. Admin APIs run behind API Gateway with a shared Cognito authorizer, which eliminates per-handler JWT validation code and provides built-in throttling, access logging, and CloudWatch metrics.
flowchart LR
subgraph Clients
Agent[AI Agent]
Admin[Admin / Operator]
end
subgraph Inference Plane
ALB[ALB<br>JWT Validation]
GW[Portkey Gateway]
end
subgraph Admin Plane
APIGW[API Gateway<br>Cognito Auth]
L1[teams Lambda]
L2[budgets Lambda]
L3[routing Lambda]
L4[scanner Lambda]
L5[pricing Lambda]
L6[usage Lambda]
end
Agent -->|/v1/messages| ALB --> GW
Admin -->|/teams, /budgets, ...| APIGW
APIGW --> L1
APIGW --> L2
APIGW --> L3
APIGW --> L4
APIGW --> L5
APIGW --> L6
Design Principles
Section titled “Design Principles”Lightweight — The gateway adds minimal overhead. Portkey OSS is a ~62 MB container that proxies requests with sub-millisecond added latency. No database, no state, no complex middleware.
Zero per-request auth cost — ALB-native JWT validation means authentication adds no cost and no extra latency beyond the ALB itself. No API Gateway, no Lambda authorizer, no per-request charges. See ADR-005.
Multi-provider — A single gateway instance routes to Bedrock, OpenAI, Anthropic, Google Vertex AI, and Azure OpenAI through Portkey’s 200+ model provider support.
Dual-format API — Both OpenAI Chat Completions (/v1/chat/completions) and Anthropic Messages (/v1/messages) are served natively on a single port, so every major coding agent works without translation layers. See ADR-006.
Infrastructure as Code — All resources are defined in Terraform with modular composition, environment-specific variable files, and automated documentation generation.
Terraform Module Dependency Graph
Section titled “Terraform Module Dependency Graph”The infrastructure is organized into 4 modules with explicit data dependencies. The root module (infrastructure/main.tf) wires them together in order.
flowchart TD
subgraph observability [Observability Module]
O1[KMS Key<br>Log encryption]
O2[CloudWatch Log Groups<br>Gateway + OTel]
O3[Dashboard +<br>Saved Queries]
end
subgraph networking [Networking Module]
N1[VPC<br>2 AZs, public + private]
N2[ALB<br>TLS + Target Groups]
N3[WAF v2<br>Managed Rules + Rate Limit]
N4[VPC Endpoints<br>ECR, CW, SM, S3]
N5[NAT Gateway<br>Single AZ]
end
subgraph auth [Auth Module]
AU1[Cognito User Pool<br>M2M client_credentials]
AU2[Resource Server<br>OAuth scopes]
AU3[JWT Listener<br>validate_token action]
end
subgraph compute [Compute Module]
C1[ECR Repository<br>Immutable tags, scan-on-push]
C2[ECS Cluster + Service<br>Fargate]
C3[Task Definition<br>Gateway + OTel sidecar]
C4[IAM Roles<br>Execution + Task]
C5[Secrets Manager<br>Provider API keys]
C6[Auto Scaling<br>CPU + ALB requests]
end
O1 -->|logs_kms_key_arn| N3
O2 -->|log_group_names| C3
N2 -->|alb_arn| AU3
N2 -->|target_group_arn| AU3
N2 -->|target_group_arn| C2
N2 -->|security_group_id| C2
N1 -->|private_subnets| C2
N2 -->|arn_suffix| C6
Module Responsibilities
Section titled “Module Responsibilities”| Module | Resources | Outputs |
|---|---|---|
| observability | KMS key, CloudWatch log groups (gateway, OTel), saved queries, dashboard | logs_kms_key_arn, gateway_log_group_name, otel_log_group_name |
| networking | VPC, subnets (2 public + 2 private), NAT Gateway, VPC endpoints, ALB, WAF | vpc_id, private_subnets, alb_arn, alb_dns_name, alb_security_group_id, alb_target_group_gateway_arn |
| auth | Cognito User Pool, resource server, M2M client, domain, JWT listener rule | cognito_user_pool_id, cognito_user_pool_arn, cognito_client_id, cognito_token_endpoint |
| compute | ECR, ECS cluster, ECS service, task definition (gateway + OTel sidecar), IAM roles, Secrets Manager entries, auto-scaling policies | ecs_cluster_name, ecs_service_name, ecr_repository_url |
| admin_api | API Gateway REST API, Cognito authorizer, per-path Lambda integrations (teams, budgets, routing, scanner, pricing, usage), CloudWatch access logging | api_url, api_execution_arn |
| audit_log | Kinesis Firehose (Parquet conversion), S3 bucket (Hive-partitioned), Glue catalog (database + table), IAM roles | s3_bucket_name, firehose_stream_name, glue_database_name |
Why This Order
Section titled “Why This Order”- Observability first — Creates the KMS key and log groups that other modules need before they can create WAF logging or container log configurations.
- Networking second — Creates the VPC, subnets, and ALB. Needs the KMS key from observability for WAF log encryption.
- Auth third — Creates the Cognito resources and the JWT validation listener rule on the ALB. Needs the ALB ARN and target group from networking.
- Compute last — Creates the ECS cluster, service, and supporting resources. Needs private subnets and ALB from networking, and log group names from observability.
Request Flow
Section titled “Request Flow”sequenceDiagram
participant Agent as AI Agent
participant ALB as ALB (TLS + WAF)
participant JWT as ALB JWT Validator
participant GW as Portkey Gateway
participant Provider as Model Provider
Agent->>ALB: POST /v1/chat/completions<br>Authorization: Bearer jwt-token
ALB->>ALB: WAF rules check<br>(rate limit, managed rules)
ALB->>JWT: Validate JWT
JWT->>JWT: Verify signature (JWKS)<br>Check iss, exp, nbf, iat
alt Token invalid
JWT-->>Agent: 401 Unauthorized
end
JWT->>GW: Forward request<br>(validated claims in headers)
GW->>GW: Parse x-portkey-* headers<br>Resolve provider + model
GW->>Provider: Proxy request<br>(with provider API key from Secrets Manager)
Provider-->>GW: Model response
GW-->>ALB: Response
ALB-->>Agent: Response
Authentication Flow
Section titled “Authentication Flow”sequenceDiagram
participant Client as AI Agent / Script
participant Cognito as Cognito User Pool
participant ALB as ALB
participant GW as Portkey Gateway
Client->>Cognito: POST /oauth2/token<br>grant_type=client_credentials<br>client_id + client_secret<br>scope=https://gateway.internal/invoke
Cognito-->>Client: JWT access token (1h TTL)
Client->>ALB: POST /v1/chat/completions<br>Authorization: Bearer jwt-token
ALB->>ALB: validate_token action<br>Verify signature via JWKS<br>Check issuer, expiry, scope
alt Token valid
ALB->>GW: Forward to target group
GW-->>ALB: Response
ALB-->>Client: 200 OK + response body
else Token invalid or expired
ALB-->>Client: 401 Unauthorized
end
The gateway uses Cognito machine-to-machine (M2M) authentication with the client_credentials OAuth 2.0 grant type. Key aspects:
- Token issuance — Cognito issues signed JWTs with a 1-hour TTL and the
https://gateway.internal/invokescope. - ALB validation — The ALB’s
validate_tokenlistener action validates JWT signatures against Cognito’s JWKS endpoint, checkingiss,exp,nbf,iat, and required scope claims. Invalid tokens receive a 401 directly from the ALB. - Zero cost — JWT validation is included in the ALB at no additional charge. No API Gateway or Lambda authorizer is needed.
Network Architecture
Section titled “Network Architecture”The VPC follows a two-AZ layout optimized for cost:
- 2 public subnets — Host the Application Load Balancer.
- 2 private subnets — Host ECS Fargate tasks (Portkey gateway + OTel sidecar).
- 1 NAT Gateway — Handles outbound internet traffic for LLM provider API calls (non-Bedrock). Single AZ to reduce cost. See ADR-003.
- VPC Endpoints — ECR (API + DKR), CloudWatch Logs, Secrets Manager, and S3 (gateway). These eliminate NAT Gateway charges for AWS service traffic.
Key Design Decisions
Section titled “Key Design Decisions”| Decision | Reference | Summary |
|---|---|---|
| Portkey OSS over LiteLLM | ADR-001 | LiteLLM has 14 CVEs including RCE; Portkey has zero CVEs and a ~62 MB image |
| ALB JWT over API Gateway | ADR-005 | Saves $260-2,400/month by validating JWTs at the ALB with zero additional latency |
| Dual API format | ADR-006 | Portkey natively serves both OpenAI and Anthropic formats on a single port |
| Single NAT + VPC endpoints | ADR-003 | Saves ~$32/month with acceptable HA trade-off for non-Bedrock outbound |
| 3-phase security pipeline | ADR-004 | Pre-build (hadolint + checkov), post-build (trivy + syft), post-scan (cosign) |
| AWS provider >= 6.22 | ADR-007 | Required for the validate_token (JWT validation) listener action on ALB |
| Two-plane architecture | ADR-014 | ALB for inference, API Gateway + Cognito for admin APIs — eliminates per-handler JWT code |