Skip to content

Feature Toggles

The AI Gateway includes optional features that extend the base platform. All features are disabled by default and can be enabled independently through toggle variables in your Terraform configuration.

FeatureToggle VariableCategory
Multi-Client Onboardingclient_configs (non-empty map)Access Control
Provider Fallback Routing(gateway config) + enable_routing_apiRouting
Cost Attribution Pipelineenable_cost_attributionCost Management
Bedrock Guardrailsenable_guardrailsContent Safety
Provider-Native Prompt Caching(gateway config)Performance
RPM & Token Rate Limitingenable_admin_apiMetering
Usage Self-Service APIenable_admin_apiMetering
Dynamic Pricing Adminenable_admin_apiMetering
Audit Log Pipelineenable_audit_logCompliance
Identity Provider Federationenable_user_authIdentity & SSO
Pre-Token Group Mappingenable_user_authIdentity & SSO

Per-team Cognito credentials that allow you to issue separate client IDs and secrets to each consuming team or service. Each client can be assigned a subset of OAuth scopes, enabling fine-grained access control.

There is no boolean toggle for this feature. The clients Terraform module is driven entirely by the client_configs map variable: one entry per team, each provisioning a dedicated Cognito app client. The module is created only when client_configs is non-empty (length(var.client_configs) > 0); leaving it at the default empty map ({}) means no per-team clients are created.

Resources created (one set per client_configs entry):

  • A dedicated Cognito User Pool app client (aws_cognito_user_pool_client) named <project_name>-<team>-<environment>, with a generated client secret
  • Per-client scope assignments drawn from each entry’s allowed_scopes (e.g., team A gets invoke only, team B gets invoke + admin)
  • A 1-hour client_credentials access-token validity

How to enable:

Set the client_configs map. Each key is a team identifier; each value is an object with allowed_scopes (a list of OAuth scope identifiers) and a human-readable description:

client_configs = {
platform = {
allowed_scopes = ["https://gateway.internal/invoke"]
description = "Platform engineering team"
}
ml-ops = {
allowed_scopes = ["https://gateway.internal/invoke", "https://gateway.internal/admin"]
description = "ML Operations team"
}
}

Each team receives its own client_id and client_secret from the Cognito User Pool (exposed via the client_ids and client_secrets Terraform outputs, keyed by team). They use the standard client_credentials grant to obtain tokens scoped to their permissions. The ALB JWT listener validates the scope claim, ensuring teams can only access endpoints their scopes allow.


agentgateway routes across providers using priority-group failover, declared in the rendered config (compute/agentgateway-config.yaml.tftpl) under ai.groups. This is always on — there is no enable/disable toggle for failover itself. Each group is a list of providers; the gateway tries the first group, then falls through to the next on failure. The default config makes Bedrock the primary and Anthropic-direct the fallback. Bedrock uses ambient ECS task-role credentials (SigV4, no static key); the Anthropic fallback uses an API key from Secrets Manager.

The optional dynamic routing API (Lambda + DynamoDB, gated by enable_routing_api) lets operators author and version routing rules; the routing_config Lambda renders them into the agentgateway backend config.

How it works:

MechanismWhereBehavior
Priority-group failoverai.groups in the rendered configTry each group in order; fall through to the next group when a provider errors
Model aliasespolicies.ai.modelAliasesRewrite a requested model id to a provider-specific id (e.g. map gpt-4* → a Bedrock Claude model)

Example (excerpt of the rendered config):

ai:
groups:
- providers:
- name: bedrock-primary
provider:
bedrock:
model: anthropic.claude-sonnet-4-20250514-v1:0
region: us-east-1
policies:
backendAuth:
aws: {} # ambient ECS task-role SigV4
- providers:
- name: anthropic-fallback
provider:
anthropic:
model: claude-sonnet-4-20250514
policies:
backendAuth:
key: ${ANTHROPIC_API_KEY}

A serverless pipeline that counts tokens, maps them to provider pricing, and publishes cost metrics to CloudWatch. This enables per-team and per-model cost visibility.

Resources created:

ResourcePurpose
Lambda functionParses gateway logs, counts prompt/completion tokens, calculates cost
DynamoDB pricing tableStores per-model pricing rates (cost per 1K tokens)
CloudWatch Logs subscriptionStreams gateway logs to the Lambda function
CloudWatch custom metricsAIGateway/TokensUsed and AIGateway/EstimatedCostUsd
Dashboard widgetsToken usage and cost-by-provider widgets added to the main dashboard

How to enable:

enable_cost_attribution = true

The gateway emits structured JSON logs for every request. A CloudWatch Logs subscription filter streams these logs to a Lambda function that extracts token counts, looks up per-model pricing in a DynamoDB table, calculates estimated cost, and publishes custom CloudWatch metrics under the AIGateway namespace.

The cost attribution Lambda reads agentgateway’s flat access log, including the cached_input_tokens and cache_creation_input_tokens fields emitted when provider-native prompt caching is active, so prompt-cache savings show up in per-team cost metrics.


Content safety powered by Amazon Bedrock Guardrails. agentgateway calls the Bedrock ApplyGuardrail API inline — in path, on both the input (prompt) and the output (completion) — signed with the ECS task role. There is no scanner Lambda and no separate scanner route; the guardrails Terraform module provisions the guardrail resource the gateway invokes.

Resources created:

ResourcePurpose
Bedrock GuardrailContent filters, sensitive-information (PII) policy, topic policies, word policies
Guardrail versionImmutable published version for production use
IAM policyGrants the ECS task role permission to call ApplyGuardrail

Detect-only by default. When enforce_guardrails = false (the default), every filter action is set to NONE: ApplyGuardrail still evaluates each request and returns assessments, but the gateway passes the request through untouched (log-only). Flip enforce_guardrails = true per environment to make filters BLOCK/ANONYMIZE and attach topic filters.

Guardrail policies:

Policy TypeDescriptionDefault Behavior (enforce_guardrails = false)
Content FilteringHate, violence, sexual, misconduct, prompt-attack categoriesEvaluated at the configured strength, action NONE (detect/log only)
Sensitive Information (PII)Detects PII entities (SSN, credit card, phone, email by default)Detected, action NONE — set to BLOCK/ANONYMIZE when enforcing
Topic PoliciesRestricted-topic deny-listAttached only when enforcing
Word PoliciesSpecific words or phrasesEvaluated, action NONE until enforcing

How to enable:

enable_guardrails = true
enforce_guardrails = false # detect/log-only; set true to BLOCK
content_filter_strength = "HIGH"
blocked_pii_types = ["SSN", "CREDIT_DEBIT_CARD_NUMBER", "PHONE", "EMAIL"]
blocked_topics = []
blocked_words = []

When the gateway’s bedrockGuardrails policy is wired (a non-empty bedrock_guardrail_id is rendered into the config), every request and response runs through ApplyGuardrail. With enforcement off, the call returns action=NONE and nothing is blocked; with enforcement on, a policy violation blocks the request with the configured message.


agentgateway has no response cache — there is no ElastiCache/Redis tier. Instead it relies on provider-native prompt caching, configured by the opt-in promptCaching policy in the rendered config. The policy injects Bedrock cachePoint markers into the system prompt, message history, and tool definitions, gated at a minimum token threshold.

This is not a response cache: every request still round-trips to the model and bills output tokens. What it saves is input-token cost on prefix reuse — a long shared system prompt or conversation prefix is billed at the cached (cheaper) rate on subsequent calls.

Scope and configuration (in the rendered config, opt-in):

policies:
ai:
promptCaching:
cacheSystem: true
cacheMessages: true
cacheTools: true
minTokens: 1024
AspectBehavior
Opt-inNo cachePoint markers are added unless the promptCaching block is present
Bedrock path onlyMarkers are injected on the bedrock-primary provider; the Anthropic-fallback provider ignores this policy
Anthropic fallbackCaching there depends on the client sending cache_control, which agentgateway passes through
minTokensPrefixes below the threshold are not marked, avoiding overhead on short prompts

Prompt-cache token counts surface in the access log as cached_input_tokens / cache_creation_input_tokens and flow through to cost attribution. See ADR-017 (which supersedes the response-cache decision in ADR-012).


These features run on the Admin API plane (see ADR-014) and are enabled with enable_admin_api = true.

Per-team rate limiting with two dimensions: requests per minute (RPM) and daily token consumption. Limits are defined per tenant tier and enforced via DynamoDB atomic counters.

DimensionDynamoDB KeyWindowTTL
RPMRATE#RPM#{team} / MINUTE#{bucket}1-minute sliding window120 seconds
Daily tokensRATE#TOKENS#{team} / DAY#{YYYY-MM-DD}Calendar day (UTC)End of day + 1 hour

Each request atomically increments the counter. When a limit is exceeded, the gateway returns a 429-equivalent response with a retry_after_seconds hint.

Tier defaults:

TierRPMDaily Tokens
sandbox20100,000
standard1001,000,000
premium50010,000,000
enterprise-1 (unlimited)-1 (unlimited)

If DynamoDB is unreachable, the request is allowed and a warning is logged. Rate limiting never blocks requests due to infrastructure failures.

A read-only API that lets teams query their own usage without waiting for monthly chargeback reports.

MethodPathDescription
GET/usage/{team}Current period usage, budget utilization, and per-model breakdown
GET/usage/{team}/historyHistorical usage by month

Runtime pricing overrides stored in DynamoDB with a static fallback table. Operators can update model pricing without redeploying the Lambda.

MethodPathDescription
GET/pricingList all pricing entries (DynamoDB overrides merged with static defaults)
GET/pricing/{provider}/{model}Get pricing for a specific model
PUT/pricing/{provider}/{model}Create or update a pricing override
DELETE/pricing/{provider}/{model}Remove a DynamoDB override (reverts to static default)

The source field in responses indicates whether a price came from "dynamodb" or "static".


A structured audit trail for all gateway requests, stored as Parquet files in S3 for Athena queries.

Resources created:

ResourcePurpose
Kinesis FirehoseIngests audit events, buffers, and converts to Parquet
S3 bucketStores Parquet files with Hive-style partitioning (year=/month=/day=)
Glue CatalogDatabase + table for Athena SQL queries
CloudWatch Log GroupFirehose delivery error logs

How to enable:

enable_audit_log = true

Audit record schema:

ColumnTypeDescription
teamstringRequesting team
user_idstringUser identity from JWT
modelstringTarget model
providerstringTarget provider
prompt_tokensintInput token count
completion_tokensintOutput token count
total_tokensintTotal tokens
cost_usddoubleEstimated cost
cache_read_tokensintTokens served from cache
cache_savings_usddoubleCost saved by cache hits
latency_msintEnd-to-end latency
statusstringRequest outcome
correlation_idstringRequest correlation ID
request_timestampstringISO 8601 timestamp

Lifecycle: 0-90 days S3 Standard, 90-365 days S3 Standard-IA, 365+ days expired.


Federation with external identity providers (AWS Identity Center, Okta, Entra ID, or any SAML 2.0 / OIDC-compliant IdP) through the existing Cognito User Pool. Users authenticate with their corporate credentials via the Cognito Hosted UI and receive JWT tokens for gateway access.

Resources created:

ResourcePurpose
aws_cognito_identity_providerOne per entry in identity_providers (SAML or OIDC)
Cognito app client (user_sso)Public client for authorization_code flow with PKCE
Cognito Hosted UI domainLogin page served by Cognito
Pre-Token-Generation V2 LambdaMaps IdP groups to custom gateway claims

How to enable:

enable_user_auth = true
identity_providers = {
IdentityCenter = {
provider_type = "SAML"
metadata_url = "https://portal.sso.us-east-1.amazonaws.com/saml/metadata/..."
provider_details = {}
attribute_mapping = {}
}
}
callback_urls = ["https://gateway.example.com/callback"]
logout_urls = ["https://gateway.example.com/logout"]

The user authenticates via the Cognito Hosted UI, which redirects to the configured IdP. After authentication, Cognito issues an authorization_code that the application exchanges for JWT tokens using PKCE. The ALB validates these tokens the same way it validates M2M tokens.

You can federate with multiple IdPs simultaneously by adding entries to the identity_providers map.

A Pre-Token-Generation V2 Lambda that runs during Cognito token issuance and maps IdP group memberships to structured gateway claims. This enables per-team authorization, cost attribution, and tier-based rate limiting without manual user provisioning.

Custom claims injected:

ClaimPurpose
custom:teamTeam identifier for routing and cost attribution
custom:org_unitOrganizational unit
custom:cost_centerCost center for billing attribution
custom:tenant_tierAuthorization tier (e.g., admin, standard, sandbox)

How to enable:

group_mapping = {
"aws-ai-gateway-admins" = {
team = "platform"
org_unit = "ai-engineering"
cost_center = "CC-1234"
tenant_tier = "admin"
}
"aws-ml-engineers" = {
team = "ml-eng"
org_unit = "ai-engineering"
cost_center = "CC-5678"
tenant_tier = "standard"
}
}

After a user authenticates via their IdP, Cognito triggers the Pre-Token Lambda before issuing the JWT. The Lambda reads the user’s IdP groups, looks up the first matching entry in group_mapping, and injects the corresponding claims into the token.

User SSO and M2M authentication share the same Cognito User Pool and ALB JWT validation. Both flows produce JWTs that the ALB validates against the same JWKS endpoint.

AspectM2MUser SSO
OAuth grantclient_credentialsauthorization_code with PKCE
CredentialsClient ID + secretCorporate IdP credentials
Token containsScopes (invoke, admin)Scopes + custom claims (team, tier)
Use caseService-to-service automationDeveloper portals, dashboards, CLI tools

All features can be enabled independently. The following matrix shows interactions for the platform features:

Multi-ClientFallback RoutingCost AttributionGuardrailsPrompt Caching
Multi-ClientCompatibleCompatible (per-client cost)CompatibleCompatible
Fallback RoutingCompatibleCompatible (multi-provider cost)CompatibleCompatible
Cost AttributionCompatible (per-client cost)Compatible (multi-provider cost)CompatibleCompatible (tracks cache-token savings)
GuardrailsCompatibleCompatibleCompatibleCompatible
Prompt CachingCompatibleCompatible (Bedrock path only)Compatible (tracks cache-token savings)Compatible
Use CaseFeaturesRationale
Multi-team platformMulti-Client + Cost AttributionPer-team credentials with per-team cost visibility
High-availability gatewayFallback Routing + Prompt CachingPriority-group failover for resilience, prompt caching to cut input-token cost on prefix reuse
Regulated workloadsMulti-Client + Guardrails + Cost AttributionAccess control, content safety, and cost tracking
Cost-optimized platformFallback Routing + Cost Attribution + Prompt CachingFail over across providers, track costs, cut input-token cost on the Bedrock path
Full platformAll features enabledComplete enterprise deployment