Policy Engine (OPA)
End-to-end view of how policy decisions — model access, feature entitlements, data residency, tool approval, memory consent, RBAC — become a first-class platform primitive evaluated by a single engine. This page consolidates the Tenant Policy Engine PRD (v0.2, Draft) and its dependencies into a single reference. The design is complete; the 6-phase rollout has not started.
This page is derived from the Tenant Policy Engine PRD and its closely related dependencies:
Primary (docs/prd/):
- Tenant Policy Engine — the v0.2 design: OPA sidecar topology, dual-source bundles (CI-built Rego + Kafka-synced data), six policy domains, per-domain fail-open/closed posture, four-layer override merge, six-phase incremental migration
Dependencies (docs/prd/multi-tenancy/):
- 02 · Projects — establishes
(tenant_id, project_id)as the policy boundary. Rego rules key intodata.projects[tid][pid]for every evaluation - 03 · Tenant Onboarding —
InternalAdminGatewayis the surface forUpdateTenantPolicyOverrides/UpdateProjectPolicyOverridesRPCs - 07 · Data Residency — the regional model
data_residencypolicy enforces
Related:
- GDPR Readiness PRD — consent tracking the
memorypolicy reads (input.user.consent.memory_extraction) - AI Safety PRD (PR #1227) — content moderation rules; open question whether they live in OPA or a dedicated ML service
- User-Aware Generation PRD — user attribution surfaced into
input.user - Billing, Metering & Invoicing Roadmap — billing Phase 5 (real-time entitlement enforcement) blocks on this PRD's Phase 1 for
opa-consumerand gateway middleware - GCP Eventarc → Kafka Pipeline Roadmap — sibling pipeline; the data-bundle sync reuses the same Pub/Sub → Kafka bridge
- Tenant & Project Lifecycle Roadmap — multi-tenant substrate; tenant/project Firestore docs are the source of truth the data bundle projects from
- Edge Idempotency Roadmap — typed-subject format (
user:*,tenant_member:*,agent:*) used in the policyinput.user.idandsubjectnamespacing
The whole design refuses to invent per-domain policy services or scatter if tenant.tier == "x" checks across the codebase:
- One engine for every policy domain. Model access, feature entitlements, data residency, tool execution, memory consent, RBAC — six domains, one OPA sidecar, one
pkg/policy.Evaluate(path, input)call. A gateway request that needs RBAC + model access + feature access is one composable decision, not three Redis lookups. - Sidecar, not shared service. Each Restate pod gets its own OPA on
localhost:8181. Sub-millisecond evaluation, no network hop, no shared-service blast radius. The trade-off is ~50MB memory per pod — cheap relative to the latency saved. - Dual-source bundles. Rego rules ship from git via CI to GCS (
policies.tar.gz, polled every 30–60s). Tenant/project config ships from Firestore via Kafka to GCS (data.tar.gz, polled every 5–10s). Logic and data have different sources, different cadences, and different change owners — but compose at evaluation time viadata.tenants[input.tenant_id]. - Input is request-shaped, data is platform-shaped. Services pass per-request context (
tenant_id,project_id,user.role,action,resource) ininput. Long-lived tenant tier, region, plan, feature flags live indata.*. Per-request payload stays ~200 bytes; Rego has full platform + tenant + project config in memory. - Restrictions stack additively across four layers. Platform base → plan-tier overlay → tenant override → project override. Denylists union across layers; allowlists pick the most-specific non-empty layer; restriction-direction-only scalars (HIPAA, retention) ratchet upward. A higher layer cannot unblock what a lower layer denied.
- Entitlements vs release flags is a hard line. Long-lived tier-driven decisions (
Free can't access voice,EU can't use US-hosted models) belong in OPA. Time-bounded toggles a PM flips from a UI (new streaming UI at 10% canary,kill switch) belong in Flipt via OpenFeature. Misclassifying the second as the first is what bloated the current scattered enforcement.
Canonical reference: Tenant Policy Engine PRD. The PRD's "What Moves to OPA vs What Stays in Code" section catalogs every current enforcement file with its target domain and migration phase.
Glossary
| Term | Definition |
|---|---|
| OPA | Open Policy Agent — CNCF-graduated policy engine. Evaluates Rego programs against JSON input + data, returns structured decisions. Self-hosted as a per-pod sidecar in this design. |
| Rego | OPA's declarative policy language. Used for all rule logic in policies/*.rego. Composable, testable via opa test, lint-checked in CI. |
| Policy bundle | A .tar.gz containing compiled Rego + optional data. OPA pulls bundles from a remote source on a polling interval and reloads on change. |
| Data bundle | A separate .tar.gz containing tenant/project/platform config as JSON. Loaded into data.tenants[*], data.projects[*][*], data.models.*, etc. — referenced by Rego rules at eval time. |
pkg/policy | Greenfield Go client library (new module). Thin HTTP wrapper around the OPA REST API. Single method: Evaluate(ctx, path, input) → (*Decision, error). Same pattern as pkg/cache, pkg/secrets. |
| Decision | { allow: bool, reasons: []string, obligations?: ... } — Rego returns structured output so callers can attach human-readable deny reasons and domain-specific obligations (e.g., retention_days, log_level). |
| Policy domain | A Rego package corresponding to one enforcement concern: policy/model_access, policy/feature_access, policy/data_residency, policy/tool_execution, policy/memory, policy/rbac. |
| Shadow mode | Migration phase 1: call OPA alongside existing code, log both decisions, alert on disagreement, but use the existing code's verdict. Catches edge cases before swapping authority. |
| Bundle builder | services/policybundlebuilder/v1/ — new Restate service. Kafka consumer on platform.config.changes. Reads affected tenant/project docs from Firestore, writes updated JSON to GCS data bundle path, debounces bursts. |
__platform__ project | Reserved synthetic project ID emitted by the bundle builder for every tenant. Permissive defaults. Callers without a real project (admin tools, cron, legacy endpoints) pass project_id="__platform__" so Rego doesn't need if project guards. InternalAdminGateway rejects user creation of this id. |
| Fail-open / fail-closed | Per-domain config for what happens when OPA is unreachable. Safety-critical domains (model_access, data_residency, tool_execution, rbac) fail-closed. Availability-sensitive domains (feature_access, memory search) fail-open. |
| Entitlement | Long-lived tenant-tier / compliance / plan / region-driven capability. Lives in OPA. Example: "EU tenants can only use models with EU DPAs." |
| Release flag | Time-bounded product/engineering toggle (canary, A/B, kill switch). Lives in Flipt via OpenFeature per PR #1270, not OPA. |
| Override layer | One of four merge layers: platform base (policies/platform/), plan-tier overlay (policies/overlays/{tier}/), tenant override (TenantPolicyOverrides on TenantProfileState), project override (ProjectPolicyOverrides on ProjectProfileState). |
Service & Component Inventory
New Services
| Component | Purpose | PRD section |
|---|---|---|
pkg/policy/ (pkg/policy/client.go, go.mod) | Thin Go HTTP client wrapping OPA's /v1/data/{path} REST endpoint. One Evaluate() method that returns a structured Decision. Cross-cutting concern shipped as a shared module. | FR-1.3 |
services/policybundlebuilder/v1/ (BundleBuilder) | Restate service + Kafka consumer on platform.config.changes. Reads Firestore tenant/project docs, writes updated JSON to GCS data-bundle path, debounces bursts (5s window), emits the synthetic __platform__ project for every tenant. | FR-5C |
| OPA sidecar container (in every Restate pod) | Open-source openpolicyagent/opa image. Configured with two bundle sources (policies + data), local decision log shipper, listens on localhost:8181 only. | FR-1.1, FR-1.2, FR-5B.1 |
New Proto Definitions
| Definition | Location | Notes |
|---|---|---|
TenantPolicyOverrides | apis/platform/v1/services/tenant/types.proto | Replaces the stub PolicyOverrides { map<string, string> overrides = 1; } at TenantProfileState field 16. The stub has no production readers (only generated MCP wrappers / defaults code), so the rename is safe. Fields: model_allowlist, model_denylist, blocked_mcp_servers, require_tool_approval_all, hipaa_mode, feature_overrides (map<string, bool>), memory_enabled, phi_retention_years. |
ProjectPolicyOverrides | apis/platform/v1/services/project/types.proto (greenfield) | Attached to the future ProjectProfileState. Fields: allowed_models, disabled_features, require_tool_approval, require_classification, allowed_classifications, memory_enabled, custom_retention_days. Until ProjectProfileState lands alongside the ProjectProvisioningWorkflow, project overrides are not persisted — Rego sees only the synthetic __platform__ project. |
| InternalAdminGateway RPCs | apis/platform/v1/services/internal-admin-gateway/ | UpdateTenantPolicyOverrides / GetEffectiveTenantPolicy / UpdateProjectPolicyOverrides / GetEffectiveProjectPolicy. |
Integrated Components (No Changes Required)
| Component | Role |
|---|---|
| GCS | Hosts both bundle objects: gs://platform-opa-bundles/platform/policies.tar.gz and .../data.tar.gz. ETag-driven If-None-Match lets OPA only download when content changed. |
| Firestore | Source of truth for tenant/project config. Firestore change stream emits to the existing Pub/Sub topic the bundle builder consumes via Kafka. |
| Pub/Sub → Kafka bridge | Reuses the same pipeline as the GCP Eventarc → Kafka Pipeline. Tenant/project doc changes route to platform.config.changes. |
| Loki / OTLP Collector | OPA decision logs ship via stdout → Promtail/OTLP → Loki for short-term queryability, plus a Kafka audit topic for long-term retention. See FR-6. |
| OpenFGA | Stays. Owns relationship-based authorization ("is user U a member of org O?", "is user U the owner of resource R?"). OPA owns policy-based authorization ("can role R do action A?"). Both engines coexist — see Appendix B in the PRD. |
| Unkey | Stays. Per-API-key rate limiting is not a policy decision and remains in Unkey. |
| Lago | Stays. Quota / metering / current_usage enforcement is billing data, not OPA. Real-time entitlement state derived from Lago lands in OPA via the billing roadmap's opa-consumer (Phase 5 there = Phase 1 here). |
Observability Additions
| Signal | Purpose |
|---|---|
| OPA decision log per evaluation | decision_id, structured input, result, reasons, timer_rego_query_eval_ns — Loki-shipped, queryable by tenant / user / action / time |
| Shadow-mode disagreement alert | During migration phase 1, every divergence between OPA and existing code logs and alerts — surfaces edge cases before swapping authority |
| Bundle sync staleness metric | Age of last-applied policy / data bundle per pod — alerts if sync lag exceeds 60s (policies) or 30s (data) |
| Bundle builder Kafka consumer lag | Detects backpressure if Firestore change rate exceeds debounce throughput |
decision_id propagation | Each deny response includes the decision id for support-ticket forensics; tied to the Loki entry |
Bundle Architecture — Logic and Data on Different Pipes
OPA bundles carry two distinct kinds of content with different sources and different update cadences. Conflating them is what makes OPA deployments brittle in practice; splitting them is the whole architectural move.
| Content type | Source | Update cadence | Example |
|---|---|---|---|
| Rego rules (logic) | Git repo (policies/) → CI build → GCS policies.tar.gz | On merge to main (typically minutes) | "PHI classification requires retention" |
| Data files (config) | Firestore → Pub/Sub → Kafka → bundle builder → GCS data.tar.gz | Real-time (seconds, debounced) | "mayo-clinic PHI retention = 10 years" |
Repo layout
policies/
├── platform/ # Base platform policies (all tenants)
│ ├── model_access.rego
│ ├── feature_access.rego
│ ├── data_residency.rego
│ ├── tool_execution.rego
│ ├── memory.rego
│ ├── rbac.rego
│ └── data.json # Shared static data (feature → tier map, plan hierarchy)
├── overlays/ # Per-plan-tier overrides
│ ├── free/feature_access.rego
│ ├── enterprise/model_access.rego
│ └── hipaa/tool_execution.rego
├── tenants/ # Per-tenant custom policies (rare)
│ └── bigbank/model_access.rego
└── tests/ # opa test fixtures, one per package
├── model_access_test.rego
└── ...
CI build pipeline
opa test policies/ -v # gate every PR
opa build -b policies/platform -b policies/overlays/${PLAN_TIER} -o policies.tar.gz
gsutil cp policies.tar.gz gs://platform-opa-bundles/platform/policies.tar.gz
Kafka → OPA data sync
┌─────────────────────┐ ┌─────────┐ ┌───────────────────┐ ┌────────────┐ ┌──────────────┐
│ InternalAdminGateway │──▶│ Firestore │──▶│ Firestore CDC │──▶│ Pub/Sub → │──▶│ BundleBuilder │
│ / ProjectWorkflow │ │ │ │ change stream │ │ Kafka │ │ (Restate svc) │
└─────────────────────┘ └─────────┘ └───────────────────┘ └────────────┘ └──────┬──────┘
│
Push to GCS data.tar.gz
│
▼
┌──────────────┐
│ OPA sidecars │
│ poll every 5s │
└──────────────┘
End-to-end latency: Firestore write → OPA has new config in ~10–30 seconds.
Why not direct Firestore → OPA
- OPA doesn't natively read from Firestore (only HTTP bundles or push).
- Kafka adds durability — if the bundle builder is briefly down, events are not lost.
- Same Pub/Sub → Kafka bridge already exists for GCS and other event sources, so this is a new consumer, not a new pipeline.
- The bundle builder can debounce bursts (5s window) into one GCS write, avoiding write-storm during bulk tenant operations.
Fallback behaviour
If the Kafka pipeline is down, OPA continues with the last-synced data bundle. Stale-but-safe by construction: policies only become more restrictive with stale data, never less (denylists union, allowlists pick most-specific non-empty, scalars ratchet toward restriction). A reconciliation job periodically rebuilds the full data bundle from Firestore as a consistency check.
__platform__ synthetic project
For every tenant, the bundle builder emits a synthetic project __platform__ with permissive defaults. This guarantees data.projects[tid][pid] always resolves and removes the need for if project guards in every Rego rule. Callers without a real project (admin paths, cron jobs, legacy endpoints) pass project_id="__platform__".
{
"allowed_models": [],
"disabled_features": [],
"require_tool_approval": false,
"require_classification": false,
"allowed_classifications": [],
"memory_enabled": true,
"custom_retention_days": 0
}
The __platform__ project name is reserved — InternalAdminGateway rejects creation with this id.
Policy Domains
Six domains, one Rego package each, one enforcement-point family each. Every domain's failure mode and migration phase is locked in the PRD.
| Domain | Rego package | What it decides | Replaces | Fail-on-OPA-down | Phase |
|---|---|---|---|---|---|
| Model access | policy/model_access | "Can this tenant/user use this model?" Region restrictions, plan-tier model lists, project allowlists | services/openrouter/v1/metadata_filter.go (access portion; metadata filters for capabilities stay) | Closed (deny) | 2 |
| Feature access | policy/feature_access | "Does this plan tier / project allow this feature?" rag, voice, multi_agent, memory, webhooks, custom_models, … gated by plan hierarchy | Tier-driven hardcoded checks (NOT release flags — see scope note) | Open (allow) | 3 |
| Data residency | policy/data_residency | "Does this action keep data in the tenant's region?" Webhook endpoint URL region check; LLM provider EU-DPA requirement | New capability — enables Data Residency PRD | Closed (deny) | 2 |
| Tool execution | policy/tool_execution | "Can this tool run, and does it need human approval?" HIPAA mode forces approval on all; specific tools require approval for non-admins; untrusted MCP servers blocked | partitionToolCalls() pattern matching in workflows/generation/v1/impl.go | Closed (deny) | 5 |
| Memory & consent | policy/memory | "Can we extract memories from this conversation? Can we search the user's memories?" Requires user consent for extraction; tenant/project toggle for search | mem0Config.GetEnabled() check in workflows/generation/v1/memory_helpers.go | Open for search, closed for extraction | 5 |
| RBAC | policy/rbac | "Can this role perform this action?" Role → action mapping (admin/owner full, developer scoped, viewer read-only) | Hardcoded role→scope map in services/auth/v1/role_resolver.go + scope_checker.go | Closed (deny) | 4 |
Per-domain Rego semantics
Two patterns matter across all six packages:
Deny wins explicitly. A bare allow if { count(deny) == 0 } fails when the body fails — and OPA falls back to default allow := true, which silently allows the request despite a deny. The PRD locks the correct shape into every default-true package:
default allow := true
deny contains msg if { ... }
allow := false if {
count(deny) > 0
}
Missing keys behave correctly under == comparisons. A search_allowed := false if tenant.memory_enabled == false rule does NOT match when tenant.memory_enabled is absent — the comparison is undefined, the rule body fails, and the default search_allowed := true stays. This is why scalar overrides in the PRD use explicit == false rather than != true.
Decision-tuple shape
Every domain returns allow: bool + optional reasons: [string]. Some domains add domain-specific fields the caller applies directly:
{
"result": {
"allow": false,
"reasons": [
"Model 'anthropic/claude-sonnet-4' is not in the EU-approved model list for tenant 'bigbank'",
"Tenant region 'eu' requires models with EU data processing agreements"
],
"obligations": {
"log_level": "warn",
"notify_admin": false
}
}
}
Storage upload returns retention_days, classification, access_default; tool execution returns require_approval: bool. These are documented in each domain's consuming PRD (e.g., Storage Model PRD §8.5 for storage obligations).
Input vs Data — What Goes Where
Services pass request-specific context in input. Tenant/project config lives in the data bundle.
| Field | Source | Rationale |
|---|---|---|
tenant_id, project_id | Input (per-request) | Identifies which data to look up |
user.id, user.role, user.scopes | Input (per-request) | Changes per request |
action, resource.* | Input (per-request) | What the user is trying to do |
context.ip, context.timestamp | Input (per-request) | Audit / forensics |
Tenant plan_tier, data_region, hipaa_mode, phi_retention_years | Data bundle | Changes rarely (admin action), shared across requests |
Project allowed_classifications, custom_retention_days, require_tool_approval | Data bundle | Changes rarely (admin action), shared across requests |
data.models.eu_approved, data.models.eu_dpa_providers, feature→tier map, plan hierarchy | Data bundle (platform-static) | Platform config, changes on deploy or admin action |
Per-request input stays ~200 bytes while OPA has full platform + tenant + project config in memory. project_id is always required — callers without a real project pass "__platform__".
Four-Layer Override Merge
Merge semantics by field shape
| Field shape | Examples | Merge rule |
|---|---|---|
| Denylists | model_denylist, blocked_mcp_servers, disabled_features | Additive across layers — every layer can add. Net deny = union(platform, tier, tenant, project). |
| Allowlists | model_allowlist, allowed_models, allowed_classifications | Most-specific non-empty wins — a non-empty layer replaces lower layers. If project.allowed_models is non-empty it is used; else tenant.model_allowlist; else platform/tier defaults. |
| Booleans / scalars | hipaa_mode, require_tool_approval, memory_enabled, phi_retention_years | Most-specific-wins, restriction-direction only. A higher layer can flip toward more restrictive (memory_enabled: false, require_tool_approval: true, higher retention) but cannot relax a lower layer's restriction. |
The single invariant
A higher layer can never unblock what a lower layer denied. A project allowlist of ["A"] does NOT grant access to model "B" if tenant.model_denylist contains "B". Allowlists narrow the available set; they do not override denylists. This invariant is what makes tenant self-service safe — tenants can ratchet down, never up.
Enforcement Integration
Call shape
Every enforcement point calls pkg/policy.Evaluate(). The standard call site:
projectID := reqCtx.GetProject()
if projectID == "" {
projectID = "__platform__" // FR-5C.4
}
decision, err := s.policyClient.Evaluate(ctx, "policy/model_access", map[string]any{
"tenant_id": reqCtx.GetTenant(),
"project_id": projectID,
"user": map[string]any{"id": reqCtx.GetSubject(), "role": userRole},
"action": "llm.generate",
"resource": map[string]any{
"model": effectiveConfig.GetModel(),
"provider": resolveProvider(effectiveConfig.GetModel()),
},
})
if err != nil {
// Per-domain fail-open / fail-closed config below
slog.Warn("policy evaluation failed", "error", err, "domain", "model_access")
}
if !decision.Allow {
return nil, sdkgo.TerminalError(
fmt.Errorf("policy denied: %s", strings.Join(decision.Reasons, "; ")),
403,
)
}
Incremental migration — three phases per enforcement point
Each enforcement point migrates independently. No big-bang cutover.
Phase 1 — Shadow mode (1 week per domain):
- Add OPA call alongside existing code
- Log both decisions, alert on disagreement
- Existing code remains authoritative
Phase 2 — OPA active:
- OPA decision becomes authoritative
- Existing code remains as dead-code fallback (feature-flagged)
- Monitor for regressions
Phase 3 — Cleanup:
- Remove old if/else/switch logic
- Remove feature flag
- OPA is sole decision maker
Shadow-mode disagreements are how edge cases land in the test corpus before they land in production denials.
What Moves to OPA vs What Stays in Code
The line: policy decisions vs business logic
| Move to OPA | Keep in code |
|---|---|
| "Can tenant X use model Y?" | "How do I call OpenRouter with model Y?" |
| "Does project P require file classification?" | "How do I write a Firestore doc?" |
| "Is this tool approved for this user?" | "How do I execute this tool?" |
| "What retention period applies?" | "How do I set GCS Object Retention?" |
| "Is this feature available on this plan?" | "How do I render the feature gate error?" |
| "Which models are allowed in EU?" | "How do I parse the OpenRouter response?" |
Rule of thumb. If the answer could differ between two tenants or two projects with the same code deployed, it's a policy decision → OPA. If it's the same regardless of who's calling, it's business logic → stays in code.
Migration map (file → domain → phase)
| Current enforcement | File | OPA domain | Migration phase |
|---|---|---|---|
| Model metadata filtering (access portion) | services/openrouter/v1/metadata_filter.go | model_access | Phase 2 |
exclude_moderated filter | services/openrouter/v1/metadata_filter.go | model_access | Phase 2 |
| HTTPS-only webhook enforcement | services/webhook/v1/tenant_helpers.go | data_residency | Phase 2 |
mem0Config.GetEnabled() (if it's entitlement, not release flag) | workflows/generation/v1/memory_helpers.go | feature_access | Phase 3 |
| Role → scope mapping | services/auth/v1/role_resolver.go | rbac | Phase 4 |
| Scope checking | services/auth/v1/scope_checker.go | rbac | Phase 4 |
| Tool approval pattern matching | workflows/generation/v1/impl.go (partitionToolCalls) | tool_execution | Phase 5 |
| Memory extraction gate | workflows/generation/v1/memory_helpers.go | memory | Phase 5 |
Stays in code (and why)
| Concern | Why it stays |
|---|---|
Request validation (buf.validate annotations) | Schema enforcement, same for all tenants |
| Restate retry/timeout config | Infrastructure concern, not tenant-specific |
| OpenFGA relationship checks ("is user member of project?") | Relationship-based auth, not policy |
| Error code mapping (gRPC/HTTP status) | Deterministic transformation |
| Kafka topic routing | Infrastructure concern |
| Proto serialization / deserialization | Mechanical transformation |
| GCS signed URL generation | Implementation detail after policy allows the download |
Quota enforcement (Lago current_usage check) | Metering/billing — Lago authoritative, not OPA |
| Rate limiting | Unkey per-key limits + KrakenD — different mechanism, different cadence |
Entitlements vs release flags — the hard line
Two distinct categories of "this tenant can/can't use feature X" decisions belong in different systems:
| Concern | System | Examples |
|---|---|---|
| Feature entitlements — long-lived rules driven by tenant tier, compliance regime, region, plan, or platform security policy | OPA (this PRD) | "Free tier can't access voice", "EU tenants can't use US-hosted models", "HIPAA-mode tenants must use HIPAA-approved models", "Enterprise tier gets longer retention" |
| Release feature flags — time-bounded toggles driven by product rollout, canary, A/B test, or kill switch | Flipt via OpenFeature (PR #1270) | "New streaming UI at 10% canary", "A/B test variant B for MCP panel", "kill switch on the buggy memory extractor", "beta tester allowlist for experimental model X" |
Rule of thumb. If a PM wants to flip it from a UI without writing code or opening a policy PR, it's a release flag → Flipt. If a platform/security engineer would write it as a Rego rule that stays stable for months or years, it's an entitlement → OPA.
A per-toggle audit of ad-hoc gates like mem0Config.GetEnabled() is required before migration — if it's actually a product rollout rather than a tier-gated entitlement, it belongs in Flipt, not in the feature_access Rego policy.
Tenant Policy Configuration
Tenants and projects configure their overrides via InternalAdminGateway RPCs. Tenants never author raw Rego — they fill structured proto fields, which the bundle builder lifts into JSON the Rego rules read via data.tenants[tid] / data.projects[tid][pid].
// apis/platform/v1/services/tenant/types.proto
message TenantPolicyOverrides {
repeated string model_allowlist = 1;
repeated string model_denylist = 2;
repeated string blocked_mcp_servers = 3;
bool require_tool_approval_all = 4;
bool hipaa_mode = 5;
map<string, bool> feature_overrides = 6;
bool memory_enabled = 7;
int32 phi_retention_years = 8;
}
message ProjectPolicyOverrides {
repeated string allowed_models = 1;
repeated string disabled_features = 2;
bool require_tool_approval = 3;
bool require_classification = 4;
repeated string allowed_classifications = 5;
bool memory_enabled = 6;
int32 custom_retention_days = 7;
}
message TenantProfileState {
// ... existing fields 1-15 ...
TenantPolicyOverrides tenant_policy_overrides = 16; // was: PolicyOverrides policy_overrides
// ... existing fields 90, 91 ...
}
ProjectProfileState does not yet exist — it lands alongside the ProjectProvisioningWorkflow and MUST carry ProjectPolicyOverrides project_policy_overrides. Until that proto ships, project overrides are not persisted; Rego sees only the synthetic __platform__ project for every request.
TenantPolicyOverrides migration note. The existing stub PolicyOverrides { map<string, string> overrides = 1; } at TenantProfileState field 16 has no production readers (only generated MCP wrappers and defaults code). The rename to TenantPolicyOverrides is safe — the Firestore field stays at position 16, only the type and field name change.
Edit Authority — Travila admin vs tenant admin
Edit authority is role-asymmetric: Travila admin authority is a strict superset of tenant admin authority. Travila admin can edit every configurable field; tenant admin can edit a defined self-serve subset.
| Layer | Edit authority |
|---|---|
Platform base (policies/platform/*.rego) | Travila only — git + CI |
Plan-tier overlay (policies/overlays/{tier}/) | Travila only — git + CI |
Tenant overrides (TenantPolicyOverrides) | Travila admin: all fields. Tenant admin: self-serve subset (TBD) |
Project overrides (ProjectPolicyOverrides) | Travila admin: all fields. Tenant admin: self-serve subset (TBD) |
Invariants regardless of role. Edit authority is which fields a role may write — it does not relax the four-layer merge. A Travila admin writing tenant.model_allowlist = ["X"] still cannot unblock something the platform base denies. All writes (Travila or tenant) traverse the same validate-then-persist pipeline.
Target surface split (final shape in a follow-up):
ConsoleGateway(/admin/v1/*) — exposes the self-serve subset ofUpdate*PolicyOverrides/GetEffective*to authenticated tenant admins.InternalAdminGateway(/internal/v1/*) — exposes the full field set to Travila staff, plus the ability to write/overwrite tenant-set fields for support and break-glass.
Deferred — what this roadmap does not commit to. The per-field role split (which fields are self-serve vs Travila-only) is not enumerated. TenantPolicyOverrides and ProjectPolicyOverrides are flat proto messages today; the split is a product/legal decision tracked as a follow-up. The MVP ships with full-set RPCs on InternalAdminGateway only; the ConsoleGateway subset lands once the per-field cut is decided.
Engineering posture while the cut is pending:
- The proto messages MUST NOT bake role authority into the schema — the same proto wires into both gateways with different field-level validation.
- Gateway implementations MUST NOT assume "tenant admin can write the whole message." Field-level authorization is a separate concern from the proto schema.
- Audit logs MUST capture which principal (Travila admin vs tenant admin) made each policy mutation, so post-hoc review can verify the eventual subset rule.
Gateway RPCs
rpc UpdateTenantPolicyOverrides(UpdateTenantPolicyOverridesRequest)
returns (UpdateTenantPolicyOverridesResponse);
rpc GetEffectiveTenantPolicy(GetEffectiveTenantPolicyRequest)
returns (GetEffectiveTenantPolicyResponse);
rpc UpdateProjectPolicyOverrides(UpdateProjectPolicyOverridesRequest)
returns (UpdateProjectPolicyOverridesResponse);
rpc GetEffectiveProjectPolicy(GetEffectiveProjectPolicyRequest)
returns (GetEffectiveProjectPolicyResponse);
GetEffective* returns the merged result of platform → tier → tenant → project layers so admins can see what is actually in force without reasoning about layering by hand.
Decision Logging & Audit
OPA sidecar → stdout (JSON decision logs)
→ Promtail / OTLP Collector
→ Loki (queryable, short retention)
→ Kafka audit topic (long-term retention)
Every evaluation produces a log entry:
{
"decision_id": "abc-123",
"timestamp": "2026-03-31T10:00:00Z",
"path": "policy/model_access",
"input": {
"tenant_id": "bigbank",
"project_id": "trading-prod",
"user": {"id": "alice", "role": "developer"},
"action": "llm.generate",
"resource": {"model": "openai/gpt-4o"}
},
"result": {
"allow": false,
"reasons": ["Model 'openai/gpt-4o' not approved for EU tenants"]
},
"metrics": {
"timer_rego_query_eval_ns": 142000
}
}
100% of deny decisions are logged (success metric). Logs are queryable by tenant, user, action, time range — "show me all denied model access requests for tenant X in the last 24 hours" is a single Loki query. The full pipeline depends on the audit logging infrastructure from Audit Logging PRD (PR #537).
Production Hardening
Per-domain fail-open / fail-closed
OPA sidecar unavailability does not block requests beyond the per-domain configured behaviour:
| Domain | Default on OPA failure | Rationale |
|---|---|---|
model_access | Fail closed | Safety / compliance critical — denying a request is better than serving from an unrestricted model list |
feature_access | Fail open | Availability-sensitive — paid users shouldn't lose access during an OPA outage |
data_residency | Fail closed | Compliance critical |
tool_execution | Fail closed | Safety critical — tools may have side effects |
memory (search) | Fail open | Search returns nothing if open and the memory store is also down |
memory (extraction) | Fail closed | Consent gate — never extract without affirmative consent |
rbac | Fail closed | Security critical |
Bundle staleness
Policy bundles are cached locally on each pod. OPA continues evaluating with the last-synced bundle if the bundle server is unavailable. Stale-but-safe is structural — restrictions only ratchet toward more-restrictive across layers, so a stale bundle never grants permissions a fresh bundle would have denied. A staleness metric alerts on lag exceeding 60s (policies) or 30s (data).
Read-only bundles, validated overrides
Policy bundles are read-only at runtime — there is no API to mutate a loaded bundle. Tenants configure overrides via structured proto fields (TenantPolicyOverrides, ProjectPolicyOverrides), which the bundle builder validates and lifts into the data bundle. Tenant-submitted overrides cannot weaken Travila-level security policies — the four-layer merge invariant guarantees this regardless of what a tenant submits.
Performance targets
| Operation | Target |
|---|---|
| OPA policy evaluation (local sidecar) | < 1ms p95 |
| Policy bundle sync | < 30s after push |
| Decision log shipping | < 5s to Loki |
| End-to-end Firestore write → OPA has new config | ~10–30s |
Sidecar memory overhead is ~50MB per Restate pod. Explicit open question 1 weighs sidecar vs shared OPA service — current leaning is sidecar for the latency, with memory accepted as cheap.
Rollout Phases
The 6-phase plan from the PRD. Each phase is independent — domains migrate one at a time, no big-bang cutover.
| Phase | Scope | Status |
|---|---|---|
1. OPA Infrastructure + pkg/policy | Add OPA sidecar to Helm charts. Create pkg/policy/ client (Evaluate()). Configure GCS bundle server + dual-source sync (policies + data). Deploy bundle builder Restate service consuming platform.config.changes. Wire decision logging to Loki + Kafka audit topic. Deploy to staging with allow-all base policies. Unblocks Billing Phase 5 (real-time entitlement enforcement) — see Billing Roadmap. | Not started |
| 2. Model Access + Data Residency | Write policy/model_access Rego with EU-approved model list and project allowlist. Write policy/data_residency Rego (webhook region, LLM provider DPA). Integrate into generation workflow + webhook service in shadow mode. Add TenantPolicyOverrides.model_allowlist/denylist. Tests for every rule. Cut over from shadow → active per the three-step migration pattern. | Not started |
| 3. Feature Access + Plan Tier | Write policy/feature_access Rego with plan tier hierarchy + feature→tier map. Create plan tier overlay bundles (policies/overlays/{free,starter,growth,enterprise}/). Integrate into gateways and generation workflow. Per-toggle audit of ad-hoc feature gates to classify entitlement vs release flag; migrate only the entitlements (release flags route to Flipt instead). | Not started |
| 4. RBAC Migration | Write policy/rbac Rego with role → action mapping. Migrate role_resolver.go and scope_checker.go logic to OPA. Integrate into KrakenD plugin / gateway deriveRequestContext(). Shadow-mode validation against existing role→scope behaviour. Remove hardcoded role maps from auth service. | Not started |
| 5. Tool Execution + Memory | Write policy/tool_execution Rego (HIPAA approval, trusted MCP servers). Write policy/memory Rego with consent integration from GDPR Readiness PRD. Migrate partitionToolCalls() pattern matching and mem0Config.GetEnabled() to OPA. Add HIPAA mode support. | Not started |
| 6. Tenant Self-Service Policy Config | Add UpdateTenantPolicyOverrides / UpdateProjectPolicyOverrides / GetEffective* RPCs to InternalAdminGateway. Implement the four-layer merge with GetEffective* so admins can see actual policy in force. Bundle build pipeline includes tenant-specific overlays. Admin dashboard surface for policy configuration. | Not started |
Dependency ordering
| Phase depends on | Reason |
|---|---|
| Phase 2 depends on Phase 1 | OPA sidecar + pkg/policy + bundle pipeline must exist before any domain has somewhere to evaluate |
| Phases 3, 4, 5 depend on Phase 1 | Same — each is an additional domain on the substrate Phase 1 ships |
| Phase 6 depends on Phase 2 minimum | First domain-with-real-rules must be in production before tenant self-service can meaningfully expose overrides; the GetEffective* merge view assumes at least one populated layer beyond platform base |
| Billing Phase 5 depends on this PRD's Phase 1 | The opa-consumer and gateway middleware in the Billing Roadmap need OPA infrastructure (sidecar deployment, pkg/policy client, GCS bundle server, decision logging) — explicit cross-PRD blocker |
| Domains within Phases 2–5 migrate independently | Each enforcement point follows shadow → active → cleanup independently; one domain's regression does not block another |
Out of Scope for MVP
- Tenant-authored Rego policies — security risk. Tenants configure via structured
TenantPolicyOverrides/ProjectPolicyOverrides, not raw Rego. Custom Rego stays inpolicies/tenants/{tid}/under platform engineering control. - Real-time policy evaluation for streaming tokens — per-token evaluation is too expensive. Policy evaluates at request start. AI Safety streaming controls are a separate concern.
- A/B testing policies — future enhancement. Not needed for initial compliance/safety use cases. Release-engineering A/B is handled by Flipt anyway.
- Cost-based policy (budget limits) — handled by Lago entitlements, not OPA. OPA reads the projected entitlement state via
opa-consumerin billing Phase 5, but the source of truth is Lago. - Network-level policy (IP allowlists at LB) — infrastructure concern; IP allowlists enforced at KrakenD/LB, not OPA. OPA can read
input.context.ipfor decision logging but does not own ingress firewalling. - Release feature flags — canary rollouts, A/B variants, kill switches, time-bounded experiments. Owned by Flipt via OpenFeature per PR #1270. Misclassifying a release flag as an entitlement is the most common adoption error.
Open Questions
| # | Question | Owner | Status |
|---|---|---|---|
| 1 | OPA sidecar vs shared OPA service? | Infrastructure | Leaning sidecar — latency is critical, ~50MB per pod is cheap |
| 2 | Should OPA replace OpenFGA for RBAC or complement it? | Engineering | Leaning complement — OpenFGA for relationship-based (is user in org?), OPA for policy-based (can role do action?). Both coexist; the call site decides which to ask |
| 3 | How to handle policy versioning for rollback? | Engineering | Open — GCS object versioning on bundles vs git-tag-based bundle builds |
| 4 | Do AI Safety content-filtering rules (PR #1227) belong in OPA or a dedicated ML service? | Engineering | Open — OPA handles structured access policy cleanly, but PII detection / toxicity scoring may need a separate ML-based service that OPA calls rather than implementing in Rego |
| 5 | Should policy decisions be exposed via API response header (e.g., X-Policy-Decision)? | Engineering | Open — useful for debugging, but risks leaking policy internals to clients |
Cross-References
- Tenant Policy Engine PRD — the v0.2 design in full
- Billing, Metering & Invoicing Roadmap — billing Phase 5 (real-time enforcement) blocks on this PRD's Phase 1
- Tenant & Project Lifecycle Roadmap — multi-tenant substrate; tenant/project Firestore docs are the source of truth the data bundle projects from
- GCP Eventarc → Kafka Pipeline Roadmap — sibling pipeline; data-bundle sync reuses the same Pub/Sub → Kafka bridge
- Edge Idempotency Roadmap — typed-subject format used in
input.user.id - Tenant Onboarding PRD — InternalAdminGateway surface for override RPCs
- Data Residency PRD — the regional model
data_residencyenforces - GDPR Readiness PRD — consent tracking the
memorypolicy reads - AI Safety PRD (PR #1227) — content moderation; open question whether it lives in OPA
- Feature Flags PRD (PR #1270) — Flipt / OpenFeature for release flags; the explicit complement to OPA's entitlement scope
- Audit Logging PRD (PR #537) — decision-log pipeline this design ships into
- Open Policy Agent docs — upstream reference
- Rego language reference — for policy authors