Skip to content

The AI Gateway includes a set of B-series features that extend the base platform with additional capabilities. All B-series features are disabled by default and can be enabled independently through toggle variables in your Terraform configuration.

FeatureModuleToggle VariableStatus
B.1 Multi-Client Onboardingmodules/clients/enable_multi_clientOpt-in
B.2 Provider Fallback RoutingPortkey config JSONsenable_fallback_routingOpt-in
B.3 Cost Attribution PipelineLambda + DynamoDBenable_cost_attributionOpt-in
B.4 Bedrock Guardrailsmodules/guardrails/enable_guardrailsOpt-in
B.5 ElastiCache Response Cachemodules/cache/enable_response_cacheOpt-in

Per-team Cognito credentials that allow you to issue separate client IDs and secrets to each consuming team or service. Each client can be assigned a subset of OAuth scopes, enabling fine-grained access control.

  • Additional Cognito User Pool clients (one per team)
  • Per-client scope assignments (e.g., team A gets invoke only, team B gets invoke + admin)
  • Optional per-client rate limiting via WAF rules
enable_multi_client = true
client_configurations = {
team-alpha = {
scopes = ["https://gateway.internal/invoke"]
}
team-beta = {
scopes = ["https://gateway.internal/invoke", "https://gateway.internal/admin"]
}
}

Each team receives its own client_id and client_secret from the Cognito User Pool. They use the standard client_credentials grant to obtain tokens scoped to their permissions. The ALB JWT listener validates the scope claim, ensuring teams can only access endpoints their scopes allow.


Portkey-native fallback and load-balancing configurations that route requests across multiple LLM providers. If the primary provider fails or is throttled, requests automatically fall back to a secondary provider.

StrategyDescriptionUse Case
FallbackTry providers in order; move to next on failureHigh availability: OpenAI primary, Bedrock fallback
Load BalanceDistribute requests across providers by weightCost optimization: 70% Bedrock, 30% OpenAI
RetryRetry failed requests on the same or different providerTransient error recovery
enable_fallback_routing = true

This deploys Portkey routing configuration files that define fallback chains and load-balancing weights. The configurations are passed to the gateway as environment variables or mounted config files.

{
"strategy": {
"mode": "fallback"
},
"targets": [
{
"provider": "openai",
"override_params": { "model": "gpt-4" }
},
{
"provider": "bedrock",
"override_params": { "model": "anthropic.claude-3-5-sonnet-20241022-v2:0" }
}
]
}
{
"strategy": {
"mode": "loadbalance"
},
"targets": [
{
"provider": "bedrock",
"weight": 0.7,
"override_params": { "model": "anthropic.claude-3-5-sonnet-20241022-v2:0" }
},
{
"provider": "openai",
"weight": 0.3,
"override_params": { "model": "gpt-4" }
}
]
}

A serverless pipeline that counts tokens, maps them to provider pricing, and publishes cost metrics to CloudWatch. This enables per-team and per-model cost visibility.

ResourcePurpose
Lambda functionParses gateway logs, counts prompt/completion tokens, calculates cost
DynamoDB pricing tableStores per-model pricing rates (cost per 1K tokens)
CloudWatch Logs subscriptionStreams gateway logs to the Lambda function
CloudWatch custom metricsAIGateway/TokensUsed and AIGateway/EstimatedCostUsd
Dashboard widgetsToken usage and cost-by-provider widgets added to the main dashboard
enable_cost_attribution = true
  1. The gateway emits structured JSON logs for every request, including provider, model, and response metadata.
  2. A CloudWatch Logs subscription filter streams these logs to the Lambda function.
  3. The Lambda function extracts token counts from the response, looks up the per-model price in the DynamoDB pricing table, and calculates the estimated cost.
  4. Token counts and cost estimates are published as CloudWatch custom metrics under the AIGateway namespace.
  5. The dashboard displays token usage and cost breakdowns by provider and model.

Content safety controls powered by Amazon Bedrock Guardrails. These are applied to requests and responses passing through the gateway, blocking harmful content before it reaches end users.

ResourcePurpose
Bedrock GuardrailContent filtering, PII detection, topic policies, word policies
Guardrail versionImmutable published version for production use
IAM policyGrants the ECS task role permission to invoke Bedrock Guardrails
Policy TypeDescriptionDefault Behavior
Content FilteringBlocks harmful content categories (hate, violence, sexual, misconduct)Block at HIGH strength for all categories
PII BlockingDetects and blocks personally identifiable informationBlocks SSN, credit card, email, phone in responses
Topic PoliciesBlocks requests about restricted topicsConfigurable deny-list
Word PoliciesBlocks specific words or patternsConfigurable word list
enable_guardrails = true
guardrail_config = {
content_filter_strength = "HIGH"
pii_action = "BLOCK"
blocked_topics = ["financial-advice", "medical-diagnosis"]
blocked_words = []
}

When guardrails are enabled, the gateway invokes Bedrock’s ApplyGuardrail API on both the input (prompt) and output (completion). If either triggers a policy violation, the request is blocked with an explanatory error message. The guardrail evaluation adds latency to each request proportional to the content length.


A Redis-based response cache that stores LLM completions keyed by the request hash. Identical requests return cached responses, reducing latency and provider API costs.

ResourcePurpose
ElastiCache Serverless (Redis 7.1)Cache store with TLS encryption in transit
Security groupAllows port 6379 from ECS tasks only
Subnet groupPlaces Redis in private subnets
SettingValue
EngineRedis 7.1
Encryption in transitTLS enabled
Eviction policyallkeys-lru (Least Recently Used)
DeploymentElastiCache Serverless (auto-scaling)
enable_response_cache = true

When enabled, the gateway container receives additional environment variables:

VariableValuePurpose
CACHE_STOREredisTells Portkey to use Redis for caching
REDIS_URLrediss://{endpoint}:6379TLS-encrypted Redis endpoint
CACHE_TTL3600Default cache TTL in seconds (1 hour)

Portkey’s built-in caching layer hashes the request body (model, messages, parameters) to generate a cache key. On cache hit, the cached response is returned immediately without calling the LLM provider. On cache miss, the provider response is stored in Redis for subsequent requests.



The C-series features add metering, governance, and self-service capabilities on top of the B-series platform. All C-series endpoints run on the Admin API Gateway plane (see ADR-014) and are enabled with enable_admin_api = true.

FeatureModuleToggle VariableStatus
C.1 RPM & Token Rate Limitingrate_limiter/enable_admin_apiOpt-in
C.2 Usage Self-Service APIusage_api/enable_admin_apiOpt-in
C.3 Dynamic Pricing Adminpricing_admin/enable_admin_apiOpt-in
C.4 Audit Log Pipelinemodules/audit_log/enable_audit_logOpt-in
C.5 Per-Team Cache Metricscost_attribution/enable_cost_attributionOpt-in

Per-team rate limiting with two dimensions: requests per minute (RPM) and daily token consumption. Limits are defined per tenant tier in TierConfig and enforced via DynamoDB atomic counters.

DimensionDynamoDB KeyWindowTTL
RPMRATE#RPM#{team} / MINUTE#{bucket}1-minute sliding window120 seconds
Daily tokensRATE#TOKENS#{team} / DAY#{YYYY-MM-DD}Calendar day (UTC)End of day + 1 hour

Each request atomically increments the counter. When a limit is exceeded, the gateway returns a 429-equivalent response with a retry_after_seconds hint.

If DynamoDB is unreachable, the request is allowed and a warning is logged. This prevents rate limiting infrastructure from becoming a single point of failure on the inference path.

TierRPMDaily Tokens
sandbox20100,000
standard1001,000,000
premium50010,000,000
enterprise-1 (unlimited)-1 (unlimited)

A read-only API that lets teams query their own usage without waiting for monthly chargeback reports.

MethodPathDescription
GET/usage/{team}Current period usage, budget utilization, and per-model breakdown
GET/usage/{team}/historyHistorical usage by month
FieldDescription
current_periodToken counts, cost, and request count for the current billing period
modelsPer-model breakdown (tokens, cost, request count)
budget_utilization_pctPercentage of monthly budget consumed
historyArray of past periods with the same structure

Runtime pricing overrides stored in DynamoDB with a static fallback table. Operators can update model pricing without redeploying the Lambda.

MethodPathDescription
GET/pricingList all pricing entries (DynamoDB overrides merged with static defaults)
GET/pricing/{provider}/{model}Get pricing for a specific model
PUT/pricing/{provider}/{model}Create or update a pricing override
DELETE/pricing/{provider}/{model}Remove a DynamoDB override (reverts to static default)
  1. DynamoDB override (if present)
  2. Static PRICING_TABLE in pricing.py

The source field in responses indicates whether a price came from "dynamodb" or "static".


A structured audit trail for all gateway requests, stored as Parquet files in S3 for Athena queries.

ResourcePurpose
Kinesis FirehoseIngests audit events, buffers, and converts to Parquet
S3 bucketStores Parquet files with Hive-style partitioning (year=/month=/day=)
Glue CatalogDatabase + table for Athena SQL queries
CloudWatch Log GroupFirehose delivery error logs
enable_audit_log = true
ColumnTypeDescription
teamstringRequesting team
user_idstringUser identity from JWT
modelstringTarget model
providerstringTarget provider
prompt_tokensintInput token count
completion_tokensintOutput token count
total_tokensintTotal tokens
cost_usddoubleEstimated cost
cache_read_tokensintTokens served from cache
cache_savings_usddoubleCost saved by cache hits
latency_msintEnd-to-end latency
statusstringRequest outcome
correlation_idstringRequest correlation ID
request_timestampstringISO 8601 timestamp
  • 0–90 days: S3 Standard
  • 90–365 days: S3 Standard-IA
  • 365+ days: Expired

Extends the cost attribution pipeline (B.3) to publish cache hit/miss metrics with a Team dimension, in addition to the existing Provider and Model dimensions.

MetricDimensionsDescription
AIGateway/CacheHitRateTeam, Provider, ModelPercentage of requests served from cache
AIGateway/CacheSavingsUsdTeamEstimated cost savings from cache hits

Use these metrics to identify teams with low cache hit rates and tune their request patterns (e.g., lowering temperature for deterministic calls).


All B-series features can be enabled independently. The following matrix shows which features complement each other and any dependencies:

B.1 Multi-ClientB.2 Fallback RoutingB.3 Cost AttributionB.4 GuardrailsB.5 Cache
B.1 Multi-ClientCompatibleCompatible (per-client cost)CompatibleCompatible
B.2 Fallback RoutingCompatibleCompatible (multi-provider cost)CompatibleCompatible
B.3 Cost AttributionCompatible (per-client cost)Compatible (multi-provider cost)CompatibleCompatible (tracks cache savings)
B.4 GuardrailsCompatibleCompatibleCompatibleOrder-dependent (see note)
B.5 CacheCompatibleCompatibleCompatible (tracks cache savings)Order-dependent (see note)
Use CaseFeaturesRationale
Multi-team platformB.1 + B.3Per-team credentials with per-team cost visibility
High-availability gatewayB.2 + B.5Fallback routing for resilience, caching for latency
Regulated workloadsB.1 + B.4 + B.3Access control, content safety, and cost tracking
Cost-optimized platformB.2 + B.3 + B.5Load-balance across providers, track costs, cache responses
Full platformB.1 + B.2 + B.3 + B.4 + B.5All features enabled for a complete enterprise deployment