Feature Toggles
The AI Gateway includes a set of B-series features that extend the base platform with additional capabilities. All B-series features are disabled by default and can be enabled independently through toggle variables in your Terraform configuration.
Overview
Section titled “Overview”| Feature | Module | Toggle Variable | Status |
|---|---|---|---|
| B.1 Multi-Client Onboarding | modules/clients/ | enable_multi_client | Opt-in |
| B.2 Provider Fallback Routing | Portkey config JSONs | enable_fallback_routing | Opt-in |
| B.3 Cost Attribution Pipeline | Lambda + DynamoDB | enable_cost_attribution | Opt-in |
| B.4 Bedrock Guardrails | modules/guardrails/ | enable_guardrails | Opt-in |
| B.5 ElastiCache Response Cache | modules/cache/ | enable_response_cache | Opt-in |
B.1 Multi-Client Onboarding
Section titled “B.1 Multi-Client Onboarding”What It Adds
Section titled “What It Adds”Per-team Cognito credentials that allow you to issue separate client IDs and secrets to each consuming team or service. Each client can be assigned a subset of OAuth scopes, enabling fine-grained access control.
Resources Created
Section titled “Resources Created”- Additional Cognito User Pool clients (one per team)
- Per-client scope assignments (e.g., team A gets
invokeonly, team B getsinvoke+admin) - Optional per-client rate limiting via WAF rules
How to Enable
Section titled “How to Enable”enable_multi_client = true
client_configurations = { team-alpha = { scopes = ["https://gateway.internal/invoke"] } team-beta = { scopes = ["https://gateway.internal/invoke", "https://gateway.internal/admin"] }}How It Works
Section titled “How It Works”Each team receives its own client_id and client_secret from the Cognito User Pool. They use the standard client_credentials grant to obtain tokens scoped to their permissions. The ALB JWT listener validates the scope claim, ensuring teams can only access endpoints their scopes allow.
B.2 Provider Fallback Routing
Section titled “B.2 Provider Fallback Routing”What It Adds
Section titled “What It Adds”Portkey-native fallback and load-balancing configurations that route requests across multiple LLM providers. If the primary provider fails or is throttled, requests automatically fall back to a secondary provider.
Routing Strategies
Section titled “Routing Strategies”| Strategy | Description | Use Case |
|---|---|---|
| Fallback | Try providers in order; move to next on failure | High availability: OpenAI primary, Bedrock fallback |
| Load Balance | Distribute requests across providers by weight | Cost optimization: 70% Bedrock, 30% OpenAI |
| Retry | Retry failed requests on the same or different provider | Transient error recovery |
How to Enable
Section titled “How to Enable”enable_fallback_routing = trueThis deploys Portkey routing configuration files that define fallback chains and load-balancing weights. The configurations are passed to the gateway as environment variables or mounted config files.
Example Fallback Configuration
Section titled “Example Fallback Configuration”{ "strategy": { "mode": "fallback" }, "targets": [ { "provider": "openai", "override_params": { "model": "gpt-4" } }, { "provider": "bedrock", "override_params": { "model": "anthropic.claude-3-5-sonnet-20241022-v2:0" } } ]}Example Load-Balancing Configuration
Section titled “Example Load-Balancing Configuration”{ "strategy": { "mode": "loadbalance" }, "targets": [ { "provider": "bedrock", "weight": 0.7, "override_params": { "model": "anthropic.claude-3-5-sonnet-20241022-v2:0" } }, { "provider": "openai", "weight": 0.3, "override_params": { "model": "gpt-4" } } ]}B.3 Cost Attribution Pipeline
Section titled “B.3 Cost Attribution Pipeline”What It Adds
Section titled “What It Adds”A serverless pipeline that counts tokens, maps them to provider pricing, and publishes cost metrics to CloudWatch. This enables per-team and per-model cost visibility.
Resources Created
Section titled “Resources Created”| Resource | Purpose |
|---|---|
| Lambda function | Parses gateway logs, counts prompt/completion tokens, calculates cost |
| DynamoDB pricing table | Stores per-model pricing rates (cost per 1K tokens) |
| CloudWatch Logs subscription | Streams gateway logs to the Lambda function |
| CloudWatch custom metrics | AIGateway/TokensUsed and AIGateway/EstimatedCostUsd |
| Dashboard widgets | Token usage and cost-by-provider widgets added to the main dashboard |
How to Enable
Section titled “How to Enable”enable_cost_attribution = trueHow It Works
Section titled “How It Works”- The gateway emits structured JSON logs for every request, including provider, model, and response metadata.
- A CloudWatch Logs subscription filter streams these logs to the Lambda function.
- The Lambda function extracts token counts from the response, looks up the per-model price in the DynamoDB pricing table, and calculates the estimated cost.
- Token counts and cost estimates are published as CloudWatch custom metrics under the
AIGatewaynamespace. - The dashboard displays token usage and cost breakdowns by provider and model.
B.4 Bedrock Guardrails
Section titled “B.4 Bedrock Guardrails”What It Adds
Section titled “What It Adds”Content safety controls powered by Amazon Bedrock Guardrails. These are applied to requests and responses passing through the gateway, blocking harmful content before it reaches end users.
Resources Created
Section titled “Resources Created”| Resource | Purpose |
|---|---|
| Bedrock Guardrail | Content filtering, PII detection, topic policies, word policies |
| Guardrail version | Immutable published version for production use |
| IAM policy | Grants the ECS task role permission to invoke Bedrock Guardrails |
Guardrail Policies
Section titled “Guardrail Policies”| Policy Type | Description | Default Behavior |
|---|---|---|
| Content Filtering | Blocks harmful content categories (hate, violence, sexual, misconduct) | Block at HIGH strength for all categories |
| PII Blocking | Detects and blocks personally identifiable information | Blocks SSN, credit card, email, phone in responses |
| Topic Policies | Blocks requests about restricted topics | Configurable deny-list |
| Word Policies | Blocks specific words or patterns | Configurable word list |
How to Enable
Section titled “How to Enable”enable_guardrails = true
guardrail_config = { content_filter_strength = "HIGH" pii_action = "BLOCK" blocked_topics = ["financial-advice", "medical-diagnosis"] blocked_words = []}How It Works
Section titled “How It Works”When guardrails are enabled, the gateway invokes Bedrock’s ApplyGuardrail API on both the input (prompt) and output (completion). If either triggers a policy violation, the request is blocked with an explanatory error message. The guardrail evaluation adds latency to each request proportional to the content length.
B.5 ElastiCache Response Cache
Section titled “B.5 ElastiCache Response Cache”What It Adds
Section titled “What It Adds”A Redis-based response cache that stores LLM completions keyed by the request hash. Identical requests return cached responses, reducing latency and provider API costs.
Resources Created
Section titled “Resources Created”| Resource | Purpose |
|---|---|
| ElastiCache Serverless (Redis 7.1) | Cache store with TLS encryption in transit |
| Security group | Allows port 6379 from ECS tasks only |
| Subnet group | Places Redis in private subnets |
Configuration
Section titled “Configuration”| Setting | Value |
|---|---|
| Engine | Redis 7.1 |
| Encryption in transit | TLS enabled |
| Eviction policy | allkeys-lru (Least Recently Used) |
| Deployment | ElastiCache Serverless (auto-scaling) |
How to Enable
Section titled “How to Enable”enable_response_cache = trueWhen enabled, the gateway container receives additional environment variables:
| Variable | Value | Purpose |
|---|---|---|
CACHE_STORE | redis | Tells Portkey to use Redis for caching |
REDIS_URL | rediss://{endpoint}:6379 | TLS-encrypted Redis endpoint |
CACHE_TTL | 3600 | Default cache TTL in seconds (1 hour) |
How It Works
Section titled “How It Works”Portkey’s built-in caching layer hashes the request body (model, messages, parameters) to generate a cache key. On cache hit, the cached response is returned immediately without calling the LLM provider. On cache miss, the provider response is stored in Redis for subsequent requests.
C-Series Features
Section titled “C-Series Features”The C-series features add metering, governance, and self-service capabilities on top of the B-series platform. All C-series endpoints run on the Admin API Gateway plane (see ADR-014) and are enabled with enable_admin_api = true.
| Feature | Module | Toggle Variable | Status |
|---|---|---|---|
| C.1 RPM & Token Rate Limiting | rate_limiter/ | enable_admin_api | Opt-in |
| C.2 Usage Self-Service API | usage_api/ | enable_admin_api | Opt-in |
| C.3 Dynamic Pricing Admin | pricing_admin/ | enable_admin_api | Opt-in |
| C.4 Audit Log Pipeline | modules/audit_log/ | enable_audit_log | Opt-in |
| C.5 Per-Team Cache Metrics | cost_attribution/ | enable_cost_attribution | Opt-in |
C.1 RPM & Token Rate Limiting
Section titled “C.1 RPM & Token Rate Limiting”What It Adds
Section titled “What It Adds”Per-team rate limiting with two dimensions: requests per minute (RPM) and daily token consumption. Limits are defined per tenant tier in TierConfig and enforced via DynamoDB atomic counters.
How It Works
Section titled “How It Works”| Dimension | DynamoDB Key | Window | TTL |
|---|---|---|---|
| RPM | RATE#RPM#{team} / MINUTE#{bucket} | 1-minute sliding window | 120 seconds |
| Daily tokens | RATE#TOKENS#{team} / DAY#{YYYY-MM-DD} | Calendar day (UTC) | End of day + 1 hour |
Each request atomically increments the counter. When a limit is exceeded, the gateway returns a 429-equivalent response with a retry_after_seconds hint.
Graceful Degradation
Section titled “Graceful Degradation”If DynamoDB is unreachable, the request is allowed and a warning is logged. This prevents rate limiting infrastructure from becoming a single point of failure on the inference path.
Tier Defaults
Section titled “Tier Defaults”| Tier | RPM | Daily Tokens |
|---|---|---|
| sandbox | 20 | 100,000 |
| standard | 100 | 1,000,000 |
| premium | 500 | 10,000,000 |
| enterprise | -1 (unlimited) | -1 (unlimited) |
C.2 Usage Self-Service API
Section titled “C.2 Usage Self-Service API”What It Adds
Section titled “What It Adds”A read-only API that lets teams query their own usage without waiting for monthly chargeback reports.
Endpoints
Section titled “Endpoints”| Method | Path | Description |
|---|---|---|
GET | /usage/{team} | Current period usage, budget utilization, and per-model breakdown |
GET | /usage/{team}/history | Historical usage by month |
Response Fields
Section titled “Response Fields”| Field | Description |
|---|---|
current_period | Token counts, cost, and request count for the current billing period |
models | Per-model breakdown (tokens, cost, request count) |
budget_utilization_pct | Percentage of monthly budget consumed |
history | Array of past periods with the same structure |
C.3 Dynamic Pricing Admin
Section titled “C.3 Dynamic Pricing Admin”What It Adds
Section titled “What It Adds”Runtime pricing overrides stored in DynamoDB with a static fallback table. Operators can update model pricing without redeploying the Lambda.
Endpoints
Section titled “Endpoints”| Method | Path | Description |
|---|---|---|
GET | /pricing | List all pricing entries (DynamoDB overrides merged with static defaults) |
GET | /pricing/{provider}/{model} | Get pricing for a specific model |
PUT | /pricing/{provider}/{model} | Create or update a pricing override |
DELETE | /pricing/{provider}/{model} | Remove a DynamoDB override (reverts to static default) |
Resolution Order
Section titled “Resolution Order”- DynamoDB override (if present)
- Static
PRICING_TABLEinpricing.py
The source field in responses indicates whether a price came from "dynamodb" or "static".
C.4 Audit Log Pipeline
Section titled “C.4 Audit Log Pipeline”What It Adds
Section titled “What It Adds”A structured audit trail for all gateway requests, stored as Parquet files in S3 for Athena queries.
Resources Created
Section titled “Resources Created”| Resource | Purpose |
|---|---|
| Kinesis Firehose | Ingests audit events, buffers, and converts to Parquet |
| S3 bucket | Stores Parquet files with Hive-style partitioning (year=/month=/day=) |
| Glue Catalog | Database + table for Athena SQL queries |
| CloudWatch Log Group | Firehose delivery error logs |
How to Enable
Section titled “How to Enable”enable_audit_log = trueAudit Record Schema
Section titled “Audit Record Schema”| Column | Type | Description |
|---|---|---|
team | string | Requesting team |
user_id | string | User identity from JWT |
model | string | Target model |
provider | string | Target provider |
prompt_tokens | int | Input token count |
completion_tokens | int | Output token count |
total_tokens | int | Total tokens |
cost_usd | double | Estimated cost |
cache_read_tokens | int | Tokens served from cache |
cache_savings_usd | double | Cost saved by cache hits |
latency_ms | int | End-to-end latency |
status | string | Request outcome |
correlation_id | string | Request correlation ID |
request_timestamp | string | ISO 8601 timestamp |
Lifecycle
Section titled “Lifecycle”- 0–90 days: S3 Standard
- 90–365 days: S3 Standard-IA
- 365+ days: Expired
C.5 Per-Team Cache Metrics
Section titled “C.5 Per-Team Cache Metrics”What It Adds
Section titled “What It Adds”Extends the cost attribution pipeline (B.3) to publish cache hit/miss metrics with a Team dimension, in addition to the existing Provider and Model dimensions.
New CloudWatch Metrics
Section titled “New CloudWatch Metrics”| Metric | Dimensions | Description |
|---|---|---|
AIGateway/CacheHitRate | Team, Provider, Model | Percentage of requests served from cache |
AIGateway/CacheSavingsUsd | Team | Estimated cost savings from cache hits |
Use these metrics to identify teams with low cache hit rates and tune their request patterns (e.g., lowering temperature for deterministic calls).
Feature Compatibility Matrix
Section titled “Feature Compatibility Matrix”All B-series features can be enabled independently. The following matrix shows which features complement each other and any dependencies:
| B.1 Multi-Client | B.2 Fallback Routing | B.3 Cost Attribution | B.4 Guardrails | B.5 Cache | |
|---|---|---|---|---|---|
| B.1 Multi-Client | — | Compatible | Compatible (per-client cost) | Compatible | Compatible |
| B.2 Fallback Routing | Compatible | — | Compatible (multi-provider cost) | Compatible | Compatible |
| B.3 Cost Attribution | Compatible (per-client cost) | Compatible (multi-provider cost) | — | Compatible | Compatible (tracks cache savings) |
| B.4 Guardrails | Compatible | Compatible | Compatible | — | Order-dependent (see note) |
| B.5 Cache | Compatible | Compatible | Compatible (tracks cache savings) | Order-dependent (see note) | — |
Recommended Combinations
Section titled “Recommended Combinations”| Use Case | Features | Rationale |
|---|---|---|
| Multi-team platform | B.1 + B.3 | Per-team credentials with per-team cost visibility |
| High-availability gateway | B.2 + B.5 | Fallback routing for resilience, caching for latency |
| Regulated workloads | B.1 + B.4 + B.3 | Access control, content safety, and cost tracking |
| Cost-optimized platform | B.2 + B.3 + B.5 | Load-balance across providers, track costs, cache responses |
| Full platform | B.1 + B.2 + B.3 + B.4 + B.5 | All features enabled for a complete enterprise deployment |