Skip to main content

Policy Engine (OPA)

End-to-end view of how policy decisions — model access, feature entitlements, data residency, tool approval, memory consent, RBAC — become a first-class platform primitive evaluated by a single engine. This page consolidates the Tenant Policy Engine PRD (v0.2, Draft) and its dependencies into a single reference. The design is complete; the 6-phase rollout has not started.

Source PRDs

This page is derived from the Tenant Policy Engine PRD and its closely related dependencies:

Primary (docs/prd/):

  • Tenant Policy Engine — the v0.2 design: OPA sidecar topology, dual-source bundles (CI-built Rego + Kafka-synced data), six policy domains, per-domain fail-open/closed posture, four-layer override merge, six-phase incremental migration

Dependencies (docs/prd/multi-tenancy/):

  • 02 · Projects — establishes (tenant_id, project_id) as the policy boundary. Rego rules key into data.projects[tid][pid] for every evaluation
  • 03 · Tenant OnboardingInternalAdminGateway is the surface for UpdateTenantPolicyOverrides / UpdateProjectPolicyOverrides RPCs
  • 07 · Data Residency — the regional model data_residency policy enforces

Related:

Architectural Direction — One Engine, Two Bundles, Composable Decisions

The whole design refuses to invent per-domain policy services or scatter if tenant.tier == "x" checks across the codebase:

  • One engine for every policy domain. Model access, feature entitlements, data residency, tool execution, memory consent, RBAC — six domains, one OPA sidecar, one pkg/policy.Evaluate(path, input) call. A gateway request that needs RBAC + model access + feature access is one composable decision, not three Redis lookups.
  • Sidecar, not shared service. Each Restate pod gets its own OPA on localhost:8181. Sub-millisecond evaluation, no network hop, no shared-service blast radius. The trade-off is ~50MB memory per pod — cheap relative to the latency saved.
  • Dual-source bundles. Rego rules ship from git via CI to GCS (policies.tar.gz, polled every 30–60s). Tenant/project config ships from Firestore via Kafka to GCS (data.tar.gz, polled every 5–10s). Logic and data have different sources, different cadences, and different change owners — but compose at evaluation time via data.tenants[input.tenant_id].
  • Input is request-shaped, data is platform-shaped. Services pass per-request context (tenant_id, project_id, user.role, action, resource) in input. Long-lived tenant tier, region, plan, feature flags live in data.*. Per-request payload stays ~200 bytes; Rego has full platform + tenant + project config in memory.
  • Restrictions stack additively across four layers. Platform base → plan-tier overlay → tenant override → project override. Denylists union across layers; allowlists pick the most-specific non-empty layer; restriction-direction-only scalars (HIPAA, retention) ratchet upward. A higher layer cannot unblock what a lower layer denied.
  • Entitlements vs release flags is a hard line. Long-lived tier-driven decisions (Free can't access voice, EU can't use US-hosted models) belong in OPA. Time-bounded toggles a PM flips from a UI (new streaming UI at 10% canary, kill switch) belong in Flipt via OpenFeature. Misclassifying the second as the first is what bloated the current scattered enforcement.

Canonical reference: Tenant Policy Engine PRD. The PRD's "What Moves to OPA vs What Stays in Code" section catalogs every current enforcement file with its target domain and migration phase.

Policy engine architecture — Rego policies pipeline (git policies/ directory → CI runs opa test + opa build → policies.tar.gz) lands on a shared GCS bundle store; in parallel, tenant/project Firestore documents flow through Firestore CDC → Pub/Sub → Kafka topic platform.config.changes → BundleBuilder Restate service (debounces 5s, emits synthetic platform project for every tenant) → data.tar.gz on the same GCS store; each Restate pod runs a Go service alongside an OPA sidecar on localhost:8181 that polls policies every 30-60s and data every 5-10s, evaluating Rego in-memory under 1ms p95; the service calls pkg/policy.Evaluate(ctx, path, input) over HTTP to /v1/data/path and receives a structured decision; OPA emits stdout JSON decision logs that flow via Promtail/OTLP Collector to Loki (queryable, short retention) and a Kafka audit topic (long-term retention); stale-but-safe by construction — if Kafka or the bundle server is down, OPA evaluates against the last-synced bundle and restrictions only ratchet toward more restrictive, never less

Glossary

TermDefinition
OPAOpen Policy Agent — CNCF-graduated policy engine. Evaluates Rego programs against JSON input + data, returns structured decisions. Self-hosted as a per-pod sidecar in this design.
RegoOPA's declarative policy language. Used for all rule logic in policies/*.rego. Composable, testable via opa test, lint-checked in CI.
Policy bundleA .tar.gz containing compiled Rego + optional data. OPA pulls bundles from a remote source on a polling interval and reloads on change.
Data bundleA separate .tar.gz containing tenant/project/platform config as JSON. Loaded into data.tenants[*], data.projects[*][*], data.models.*, etc. — referenced by Rego rules at eval time.
pkg/policyGreenfield Go client library (new module). Thin HTTP wrapper around the OPA REST API. Single method: Evaluate(ctx, path, input) → (*Decision, error). Same pattern as pkg/cache, pkg/secrets.
Decision{ allow: bool, reasons: []string, obligations?: ... } — Rego returns structured output so callers can attach human-readable deny reasons and domain-specific obligations (e.g., retention_days, log_level).
Policy domainA Rego package corresponding to one enforcement concern: policy/model_access, policy/feature_access, policy/data_residency, policy/tool_execution, policy/memory, policy/rbac.
Shadow modeMigration phase 1: call OPA alongside existing code, log both decisions, alert on disagreement, but use the existing code's verdict. Catches edge cases before swapping authority.
Bundle builderservices/policybundlebuilder/v1/ — new Restate service. Kafka consumer on platform.config.changes. Reads affected tenant/project docs from Firestore, writes updated JSON to GCS data bundle path, debounces bursts.
__platform__ projectReserved synthetic project ID emitted by the bundle builder for every tenant. Permissive defaults. Callers without a real project (admin tools, cron, legacy endpoints) pass project_id="__platform__" so Rego doesn't need if project guards. InternalAdminGateway rejects user creation of this id.
Fail-open / fail-closedPer-domain config for what happens when OPA is unreachable. Safety-critical domains (model_access, data_residency, tool_execution, rbac) fail-closed. Availability-sensitive domains (feature_access, memory search) fail-open.
EntitlementLong-lived tenant-tier / compliance / plan / region-driven capability. Lives in OPA. Example: "EU tenants can only use models with EU DPAs."
Release flagTime-bounded product/engineering toggle (canary, A/B, kill switch). Lives in Flipt via OpenFeature per PR #1270, not OPA.
Override layerOne of four merge layers: platform base (policies/platform/), plan-tier overlay (policies/overlays/{tier}/), tenant override (TenantPolicyOverrides on TenantProfileState), project override (ProjectPolicyOverrides on ProjectProfileState).

Service & Component Inventory

New Services

ComponentPurposePRD section
pkg/policy/ (pkg/policy/client.go, go.mod)Thin Go HTTP client wrapping OPA's /v1/data/{path} REST endpoint. One Evaluate() method that returns a structured Decision. Cross-cutting concern shipped as a shared module.FR-1.3
services/policybundlebuilder/v1/ (BundleBuilder)Restate service + Kafka consumer on platform.config.changes. Reads Firestore tenant/project docs, writes updated JSON to GCS data-bundle path, debounces bursts (5s window), emits the synthetic __platform__ project for every tenant.FR-5C
OPA sidecar container (in every Restate pod)Open-source openpolicyagent/opa image. Configured with two bundle sources (policies + data), local decision log shipper, listens on localhost:8181 only.FR-1.1, FR-1.2, FR-5B.1

New Proto Definitions

DefinitionLocationNotes
TenantPolicyOverridesapis/platform/v1/services/tenant/types.protoReplaces the stub PolicyOverrides { map<string, string> overrides = 1; } at TenantProfileState field 16. The stub has no production readers (only generated MCP wrappers / defaults code), so the rename is safe. Fields: model_allowlist, model_denylist, blocked_mcp_servers, require_tool_approval_all, hipaa_mode, feature_overrides (map<string, bool>), memory_enabled, phi_retention_years.
ProjectPolicyOverridesapis/platform/v1/services/project/types.proto (greenfield)Attached to the future ProjectProfileState. Fields: allowed_models, disabled_features, require_tool_approval, require_classification, allowed_classifications, memory_enabled, custom_retention_days. Until ProjectProfileState lands alongside the ProjectProvisioningWorkflow, project overrides are not persisted — Rego sees only the synthetic __platform__ project.
InternalAdminGateway RPCsapis/platform/v1/services/internal-admin-gateway/UpdateTenantPolicyOverrides / GetEffectiveTenantPolicy / UpdateProjectPolicyOverrides / GetEffectiveProjectPolicy.

Integrated Components (No Changes Required)

ComponentRole
GCSHosts both bundle objects: gs://platform-opa-bundles/platform/policies.tar.gz and .../data.tar.gz. ETag-driven If-None-Match lets OPA only download when content changed.
FirestoreSource of truth for tenant/project config. Firestore change stream emits to the existing Pub/Sub topic the bundle builder consumes via Kafka.
Pub/Sub → Kafka bridgeReuses the same pipeline as the GCP Eventarc → Kafka Pipeline. Tenant/project doc changes route to platform.config.changes.
Loki / OTLP CollectorOPA decision logs ship via stdout → Promtail/OTLP → Loki for short-term queryability, plus a Kafka audit topic for long-term retention. See FR-6.
OpenFGAStays. Owns relationship-based authorization ("is user U a member of org O?", "is user U the owner of resource R?"). OPA owns policy-based authorization ("can role R do action A?"). Both engines coexist — see Appendix B in the PRD.
UnkeyStays. Per-API-key rate limiting is not a policy decision and remains in Unkey.
LagoStays. Quota / metering / current_usage enforcement is billing data, not OPA. Real-time entitlement state derived from Lago lands in OPA via the billing roadmap's opa-consumer (Phase 5 there = Phase 1 here).

Observability Additions

SignalPurpose
OPA decision log per evaluationdecision_id, structured input, result, reasons, timer_rego_query_eval_ns — Loki-shipped, queryable by tenant / user / action / time
Shadow-mode disagreement alertDuring migration phase 1, every divergence between OPA and existing code logs and alerts — surfaces edge cases before swapping authority
Bundle sync staleness metricAge of last-applied policy / data bundle per pod — alerts if sync lag exceeds 60s (policies) or 30s (data)
Bundle builder Kafka consumer lagDetects backpressure if Firestore change rate exceeds debounce throughput
decision_id propagationEach deny response includes the decision id for support-ticket forensics; tied to the Loki entry

Bundle Architecture — Logic and Data on Different Pipes

OPA bundles carry two distinct kinds of content with different sources and different update cadences. Conflating them is what makes OPA deployments brittle in practice; splitting them is the whole architectural move.

Content typeSourceUpdate cadenceExample
Rego rules (logic)Git repo (policies/) → CI build → GCS policies.tar.gzOn merge to main (typically minutes)"PHI classification requires retention"
Data files (config)Firestore → Pub/Sub → Kafka → bundle builder → GCS data.tar.gzReal-time (seconds, debounced)"mayo-clinic PHI retention = 10 years"

Repo layout

policies/
├── platform/ # Base platform policies (all tenants)
│ ├── model_access.rego
│ ├── feature_access.rego
│ ├── data_residency.rego
│ ├── tool_execution.rego
│ ├── memory.rego
│ ├── rbac.rego
│ └── data.json # Shared static data (feature → tier map, plan hierarchy)
├── overlays/ # Per-plan-tier overrides
│ ├── free/feature_access.rego
│ ├── enterprise/model_access.rego
│ └── hipaa/tool_execution.rego
├── tenants/ # Per-tenant custom policies (rare)
│ └── bigbank/model_access.rego
└── tests/ # opa test fixtures, one per package
├── model_access_test.rego
└── ...

CI build pipeline

opa test policies/ -v                                                      # gate every PR
opa build -b policies/platform -b policies/overlays/${PLAN_TIER} -o policies.tar.gz
gsutil cp policies.tar.gz gs://platform-opa-bundles/platform/policies.tar.gz

Kafka → OPA data sync

┌─────────────────────┐   ┌─────────┐   ┌───────────────────┐   ┌────────────┐   ┌──────────────┐
│ InternalAdminGateway │──▶│ Firestore │──▶│ Firestore CDC │──▶│ Pub/Sub → │──▶│ BundleBuilder │
│ / ProjectWorkflow │ │ │ │ change stream │ │ Kafka │ │ (Restate svc) │
└─────────────────────┘ └─────────┘ └───────────────────┘ └────────────┘ └──────┬──────┘

Push to GCS data.tar.gz


┌──────────────┐
│ OPA sidecars │
│ poll every 5s │
└──────────────┘

End-to-end latency: Firestore write → OPA has new config in ~10–30 seconds.

Why not direct Firestore → OPA

  • OPA doesn't natively read from Firestore (only HTTP bundles or push).
  • Kafka adds durability — if the bundle builder is briefly down, events are not lost.
  • Same Pub/Sub → Kafka bridge already exists for GCS and other event sources, so this is a new consumer, not a new pipeline.
  • The bundle builder can debounce bursts (5s window) into one GCS write, avoiding write-storm during bulk tenant operations.

Fallback behaviour

If the Kafka pipeline is down, OPA continues with the last-synced data bundle. Stale-but-safe by construction: policies only become more restrictive with stale data, never less (denylists union, allowlists pick most-specific non-empty, scalars ratchet toward restriction). A reconciliation job periodically rebuilds the full data bundle from Firestore as a consistency check.

__platform__ synthetic project

For every tenant, the bundle builder emits a synthetic project __platform__ with permissive defaults. This guarantees data.projects[tid][pid] always resolves and removes the need for if project guards in every Rego rule. Callers without a real project (admin paths, cron jobs, legacy endpoints) pass project_id="__platform__".

{
"allowed_models": [],
"disabled_features": [],
"require_tool_approval": false,
"require_classification": false,
"allowed_classifications": [],
"memory_enabled": true,
"custom_retention_days": 0
}

The __platform__ project name is reserved — InternalAdminGateway rejects creation with this id.

Policy Domains

Six domains, one Rego package each, one enforcement-point family each. Every domain's failure mode and migration phase is locked in the PRD.

DomainRego packageWhat it decidesReplacesFail-on-OPA-downPhase
Model accesspolicy/model_access"Can this tenant/user use this model?" Region restrictions, plan-tier model lists, project allowlistsservices/openrouter/v1/metadata_filter.go (access portion; metadata filters for capabilities stay)Closed (deny)2
Feature accesspolicy/feature_access"Does this plan tier / project allow this feature?" rag, voice, multi_agent, memory, webhooks, custom_models, … gated by plan hierarchyTier-driven hardcoded checks (NOT release flags — see scope note)Open (allow)3
Data residencypolicy/data_residency"Does this action keep data in the tenant's region?" Webhook endpoint URL region check; LLM provider EU-DPA requirementNew capability — enables Data Residency PRDClosed (deny)2
Tool executionpolicy/tool_execution"Can this tool run, and does it need human approval?" HIPAA mode forces approval on all; specific tools require approval for non-admins; untrusted MCP servers blockedpartitionToolCalls() pattern matching in workflows/generation/v1/impl.goClosed (deny)5
Memory & consentpolicy/memory"Can we extract memories from this conversation? Can we search the user's memories?" Requires user consent for extraction; tenant/project toggle for searchmem0Config.GetEnabled() check in workflows/generation/v1/memory_helpers.goOpen for search, closed for extraction5
RBACpolicy/rbac"Can this role perform this action?" Role → action mapping (admin/owner full, developer scoped, viewer read-only)Hardcoded role→scope map in services/auth/v1/role_resolver.go + scope_checker.goClosed (deny)4

Per-domain Rego semantics

Two patterns matter across all six packages:

Deny wins explicitly. A bare allow if { count(deny) == 0 } fails when the body fails — and OPA falls back to default allow := true, which silently allows the request despite a deny. The PRD locks the correct shape into every default-true package:

default allow := true

deny contains msg if { ... }

allow := false if {
count(deny) > 0
}

Missing keys behave correctly under == comparisons. A search_allowed := false if tenant.memory_enabled == false rule does NOT match when tenant.memory_enabled is absent — the comparison is undefined, the rule body fails, and the default search_allowed := true stays. This is why scalar overrides in the PRD use explicit == false rather than != true.

Decision-tuple shape

Every domain returns allow: bool + optional reasons: [string]. Some domains add domain-specific fields the caller applies directly:

{
"result": {
"allow": false,
"reasons": [
"Model 'anthropic/claude-sonnet-4' is not in the EU-approved model list for tenant 'bigbank'",
"Tenant region 'eu' requires models with EU data processing agreements"
],
"obligations": {
"log_level": "warn",
"notify_admin": false
}
}
}

Storage upload returns retention_days, classification, access_default; tool execution returns require_approval: bool. These are documented in each domain's consuming PRD (e.g., Storage Model PRD §8.5 for storage obligations).

Input vs Data — What Goes Where

Services pass request-specific context in input. Tenant/project config lives in the data bundle.

FieldSourceRationale
tenant_id, project_idInput (per-request)Identifies which data to look up
user.id, user.role, user.scopesInput (per-request)Changes per request
action, resource.*Input (per-request)What the user is trying to do
context.ip, context.timestampInput (per-request)Audit / forensics
Tenant plan_tier, data_region, hipaa_mode, phi_retention_yearsData bundleChanges rarely (admin action), shared across requests
Project allowed_classifications, custom_retention_days, require_tool_approvalData bundleChanges rarely (admin action), shared across requests
data.models.eu_approved, data.models.eu_dpa_providers, feature→tier map, plan hierarchyData bundle (platform-static)Platform config, changes on deploy or admin action

Per-request input stays ~200 bytes while OPA has full platform + tenant + project config in memory. project_id is always required — callers without a real project pass "__platform__".

Four-Layer Override Merge

Four-layer override merge — policies stack from least to most specific: (1) platform base in policies/platform/ via git CI build, supplies data residency rules and default RBAC; (2) plan-tier overlay in policies/overlays/tier/ via git CI build per tier, e.g. enterprise gets expanded models, hipaa gets mandatory tool approval; (3) tenant override in data.tenants[tid] sourced from Firestore via Kafka into data.tar.gz, e.g. hipaa_mode true, model_denylist [openai/*], phi_retention_years 10; (4) project override in data.projects[tid][pid] also from Firestore via Kafka, e.g. disabled_features [voice], allowed_models [claude-sonnet-4]. Three merge rules apply by field shape: denylists are additive across all four layers (net deny = union of all); allowlists pick the most-specific non-empty layer (project → tenant → tier → platform, first non-empty wins); booleans and scalars use most-specific-wins but restriction-direction only (higher layers ratchet toward more restrictive, never relax). The single invariant: a higher layer can never unblock what a lower layer denied — project.allowed_models=[&quot;A&quot;] does NOT grant access to &quot;B&quot; if tenant.model_denylist contains &quot;B&quot;

Merge semantics by field shape

Field shapeExamplesMerge rule
Denylistsmodel_denylist, blocked_mcp_servers, disabled_featuresAdditive across layers — every layer can add. Net deny = union(platform, tier, tenant, project).
Allowlistsmodel_allowlist, allowed_models, allowed_classificationsMost-specific non-empty wins — a non-empty layer replaces lower layers. If project.allowed_models is non-empty it is used; else tenant.model_allowlist; else platform/tier defaults.
Booleans / scalarshipaa_mode, require_tool_approval, memory_enabled, phi_retention_yearsMost-specific-wins, restriction-direction only. A higher layer can flip toward more restrictive (memory_enabled: false, require_tool_approval: true, higher retention) but cannot relax a lower layer's restriction.

The single invariant

A higher layer can never unblock what a lower layer denied. A project allowlist of ["A"] does NOT grant access to model "B" if tenant.model_denylist contains "B". Allowlists narrow the available set; they do not override denylists. This invariant is what makes tenant self-service safe — tenants can ratchet down, never up.

Enforcement Integration

Call shape

Every enforcement point calls pkg/policy.Evaluate(). The standard call site:

projectID := reqCtx.GetProject()
if projectID == "" {
projectID = "__platform__" // FR-5C.4
}

decision, err := s.policyClient.Evaluate(ctx, "policy/model_access", map[string]any{
"tenant_id": reqCtx.GetTenant(),
"project_id": projectID,
"user": map[string]any{"id": reqCtx.GetSubject(), "role": userRole},
"action": "llm.generate",
"resource": map[string]any{
"model": effectiveConfig.GetModel(),
"provider": resolveProvider(effectiveConfig.GetModel()),
},
})
if err != nil {
// Per-domain fail-open / fail-closed config below
slog.Warn("policy evaluation failed", "error", err, "domain", "model_access")
}
if !decision.Allow {
return nil, sdkgo.TerminalError(
fmt.Errorf("policy denied: %s", strings.Join(decision.Reasons, "; ")),
403,
)
}

Incremental migration — three phases per enforcement point

Each enforcement point migrates independently. No big-bang cutover.

Phase 1 — Shadow mode (1 week per domain):
- Add OPA call alongside existing code
- Log both decisions, alert on disagreement
- Existing code remains authoritative

Phase 2 — OPA active:
- OPA decision becomes authoritative
- Existing code remains as dead-code fallback (feature-flagged)
- Monitor for regressions

Phase 3 — Cleanup:
- Remove old if/else/switch logic
- Remove feature flag
- OPA is sole decision maker

Shadow-mode disagreements are how edge cases land in the test corpus before they land in production denials.

What Moves to OPA vs What Stays in Code

The line: policy decisions vs business logic

Move to OPAKeep in code
"Can tenant X use model Y?""How do I call OpenRouter with model Y?"
"Does project P require file classification?""How do I write a Firestore doc?"
"Is this tool approved for this user?""How do I execute this tool?"
"What retention period applies?""How do I set GCS Object Retention?"
"Is this feature available on this plan?""How do I render the feature gate error?"
"Which models are allowed in EU?""How do I parse the OpenRouter response?"

Rule of thumb. If the answer could differ between two tenants or two projects with the same code deployed, it's a policy decision → OPA. If it's the same regardless of who's calling, it's business logic → stays in code.

Migration map (file → domain → phase)

Current enforcementFileOPA domainMigration phase
Model metadata filtering (access portion)services/openrouter/v1/metadata_filter.gomodel_accessPhase 2
exclude_moderated filterservices/openrouter/v1/metadata_filter.gomodel_accessPhase 2
HTTPS-only webhook enforcementservices/webhook/v1/tenant_helpers.godata_residencyPhase 2
mem0Config.GetEnabled() (if it's entitlement, not release flag)workflows/generation/v1/memory_helpers.gofeature_accessPhase 3
Role → scope mappingservices/auth/v1/role_resolver.gorbacPhase 4
Scope checkingservices/auth/v1/scope_checker.gorbacPhase 4
Tool approval pattern matchingworkflows/generation/v1/impl.go (partitionToolCalls)tool_executionPhase 5
Memory extraction gateworkflows/generation/v1/memory_helpers.gomemoryPhase 5

Stays in code (and why)

ConcernWhy it stays
Request validation (buf.validate annotations)Schema enforcement, same for all tenants
Restate retry/timeout configInfrastructure concern, not tenant-specific
OpenFGA relationship checks ("is user member of project?")Relationship-based auth, not policy
Error code mapping (gRPC/HTTP status)Deterministic transformation
Kafka topic routingInfrastructure concern
Proto serialization / deserializationMechanical transformation
GCS signed URL generationImplementation detail after policy allows the download
Quota enforcement (Lago current_usage check)Metering/billing — Lago authoritative, not OPA
Rate limitingUnkey per-key limits + KrakenD — different mechanism, different cadence

Entitlements vs release flags — the hard line

Two distinct categories of "this tenant can/can't use feature X" decisions belong in different systems:

ConcernSystemExamples
Feature entitlements — long-lived rules driven by tenant tier, compliance regime, region, plan, or platform security policyOPA (this PRD)"Free tier can't access voice", "EU tenants can't use US-hosted models", "HIPAA-mode tenants must use HIPAA-approved models", "Enterprise tier gets longer retention"
Release feature flags — time-bounded toggles driven by product rollout, canary, A/B test, or kill switchFlipt via OpenFeature (PR #1270)"New streaming UI at 10% canary", "A/B test variant B for MCP panel", "kill switch on the buggy memory extractor", "beta tester allowlist for experimental model X"

Rule of thumb. If a PM wants to flip it from a UI without writing code or opening a policy PR, it's a release flag → Flipt. If a platform/security engineer would write it as a Rego rule that stays stable for months or years, it's an entitlement → OPA.

A per-toggle audit of ad-hoc gates like mem0Config.GetEnabled() is required before migration — if it's actually a product rollout rather than a tier-gated entitlement, it belongs in Flipt, not in the feature_access Rego policy.

Tenant Policy Configuration

Tenants and projects configure their overrides via InternalAdminGateway RPCs. Tenants never author raw Rego — they fill structured proto fields, which the bundle builder lifts into JSON the Rego rules read via data.tenants[tid] / data.projects[tid][pid].

// apis/platform/v1/services/tenant/types.proto

message TenantPolicyOverrides {
repeated string model_allowlist = 1;
repeated string model_denylist = 2;
repeated string blocked_mcp_servers = 3;
bool require_tool_approval_all = 4;
bool hipaa_mode = 5;
map<string, bool> feature_overrides = 6;
bool memory_enabled = 7;
int32 phi_retention_years = 8;
}

message ProjectPolicyOverrides {
repeated string allowed_models = 1;
repeated string disabled_features = 2;
bool require_tool_approval = 3;
bool require_classification = 4;
repeated string allowed_classifications = 5;
bool memory_enabled = 6;
int32 custom_retention_days = 7;
}

message TenantProfileState {
// ... existing fields 1-15 ...
TenantPolicyOverrides tenant_policy_overrides = 16; // was: PolicyOverrides policy_overrides
// ... existing fields 90, 91 ...
}

ProjectProfileState does not yet exist — it lands alongside the ProjectProvisioningWorkflow and MUST carry ProjectPolicyOverrides project_policy_overrides. Until that proto ships, project overrides are not persisted; Rego sees only the synthetic __platform__ project for every request.

TenantPolicyOverrides migration note. The existing stub PolicyOverrides { map<string, string> overrides = 1; } at TenantProfileState field 16 has no production readers (only generated MCP wrappers and defaults code). The rename to TenantPolicyOverrides is safe — the Firestore field stays at position 16, only the type and field name change.

Edit Authority — Travila admin vs tenant admin

Edit authority is role-asymmetric: Travila admin authority is a strict superset of tenant admin authority. Travila admin can edit every configurable field; tenant admin can edit a defined self-serve subset.

LayerEdit authority
Platform base (policies/platform/*.rego)Travila only — git + CI
Plan-tier overlay (policies/overlays/{tier}/)Travila only — git + CI
Tenant overrides (TenantPolicyOverrides)Travila admin: all fields. Tenant admin: self-serve subset (TBD)
Project overrides (ProjectPolicyOverrides)Travila admin: all fields. Tenant admin: self-serve subset (TBD)

Invariants regardless of role. Edit authority is which fields a role may write — it does not relax the four-layer merge. A Travila admin writing tenant.model_allowlist = ["X"] still cannot unblock something the platform base denies. All writes (Travila or tenant) traverse the same validate-then-persist pipeline.

Target surface split (final shape in a follow-up):

  • ConsoleGateway (/admin/v1/*) — exposes the self-serve subset of Update*PolicyOverrides / GetEffective* to authenticated tenant admins.
  • InternalAdminGateway (/internal/v1/*) — exposes the full field set to Travila staff, plus the ability to write/overwrite tenant-set fields for support and break-glass.

Deferred — what this roadmap does not commit to. The per-field role split (which fields are self-serve vs Travila-only) is not enumerated. TenantPolicyOverrides and ProjectPolicyOverrides are flat proto messages today; the split is a product/legal decision tracked as a follow-up. The MVP ships with full-set RPCs on InternalAdminGateway only; the ConsoleGateway subset lands once the per-field cut is decided.

Engineering posture while the cut is pending:

  • The proto messages MUST NOT bake role authority into the schema — the same proto wires into both gateways with different field-level validation.
  • Gateway implementations MUST NOT assume "tenant admin can write the whole message." Field-level authorization is a separate concern from the proto schema.
  • Audit logs MUST capture which principal (Travila admin vs tenant admin) made each policy mutation, so post-hoc review can verify the eventual subset rule.

Gateway RPCs

rpc UpdateTenantPolicyOverrides(UpdateTenantPolicyOverridesRequest)
returns (UpdateTenantPolicyOverridesResponse);
rpc GetEffectiveTenantPolicy(GetEffectiveTenantPolicyRequest)
returns (GetEffectiveTenantPolicyResponse);

rpc UpdateProjectPolicyOverrides(UpdateProjectPolicyOverridesRequest)
returns (UpdateProjectPolicyOverridesResponse);
rpc GetEffectiveProjectPolicy(GetEffectiveProjectPolicyRequest)
returns (GetEffectiveProjectPolicyResponse);

GetEffective* returns the merged result of platform → tier → tenant → project layers so admins can see what is actually in force without reasoning about layering by hand.

Decision Logging & Audit

OPA sidecar → stdout (JSON decision logs)
→ Promtail / OTLP Collector
→ Loki (queryable, short retention)
→ Kafka audit topic (long-term retention)

Every evaluation produces a log entry:

{
"decision_id": "abc-123",
"timestamp": "2026-03-31T10:00:00Z",
"path": "policy/model_access",
"input": {
"tenant_id": "bigbank",
"project_id": "trading-prod",
"user": {"id": "alice", "role": "developer"},
"action": "llm.generate",
"resource": {"model": "openai/gpt-4o"}
},
"result": {
"allow": false,
"reasons": ["Model 'openai/gpt-4o' not approved for EU tenants"]
},
"metrics": {
"timer_rego_query_eval_ns": 142000
}
}

100% of deny decisions are logged (success metric). Logs are queryable by tenant, user, action, time range — "show me all denied model access requests for tenant X in the last 24 hours" is a single Loki query. The full pipeline depends on the audit logging infrastructure from Audit Logging PRD (PR #537).

Production Hardening

Per-domain fail-open / fail-closed

OPA sidecar unavailability does not block requests beyond the per-domain configured behaviour:

DomainDefault on OPA failureRationale
model_accessFail closedSafety / compliance critical — denying a request is better than serving from an unrestricted model list
feature_accessFail openAvailability-sensitive — paid users shouldn't lose access during an OPA outage
data_residencyFail closedCompliance critical
tool_executionFail closedSafety critical — tools may have side effects
memory (search)Fail openSearch returns nothing if open and the memory store is also down
memory (extraction)Fail closedConsent gate — never extract without affirmative consent
rbacFail closedSecurity critical

Bundle staleness

Policy bundles are cached locally on each pod. OPA continues evaluating with the last-synced bundle if the bundle server is unavailable. Stale-but-safe is structural — restrictions only ratchet toward more-restrictive across layers, so a stale bundle never grants permissions a fresh bundle would have denied. A staleness metric alerts on lag exceeding 60s (policies) or 30s (data).

Read-only bundles, validated overrides

Policy bundles are read-only at runtime — there is no API to mutate a loaded bundle. Tenants configure overrides via structured proto fields (TenantPolicyOverrides, ProjectPolicyOverrides), which the bundle builder validates and lifts into the data bundle. Tenant-submitted overrides cannot weaken Travila-level security policies — the four-layer merge invariant guarantees this regardless of what a tenant submits.

Performance targets

OperationTarget
OPA policy evaluation (local sidecar)< 1ms p95
Policy bundle sync< 30s after push
Decision log shipping< 5s to Loki
End-to-end Firestore write → OPA has new config~10–30s

Sidecar memory overhead is ~50MB per Restate pod. Explicit open question 1 weighs sidecar vs shared OPA service — current leaning is sidecar for the latency, with memory accepted as cheap.

Rollout Phases

The 6-phase plan from the PRD. Each phase is independent — domains migrate one at a time, no big-bang cutover.

PhaseScopeStatus
1. OPA Infrastructure + pkg/policyAdd OPA sidecar to Helm charts. Create pkg/policy/ client (Evaluate()). Configure GCS bundle server + dual-source sync (policies + data). Deploy bundle builder Restate service consuming platform.config.changes. Wire decision logging to Loki + Kafka audit topic. Deploy to staging with allow-all base policies. Unblocks Billing Phase 5 (real-time entitlement enforcement) — see Billing Roadmap.Not started
2. Model Access + Data ResidencyWrite policy/model_access Rego with EU-approved model list and project allowlist. Write policy/data_residency Rego (webhook region, LLM provider DPA). Integrate into generation workflow + webhook service in shadow mode. Add TenantPolicyOverrides.model_allowlist/denylist. Tests for every rule. Cut over from shadow → active per the three-step migration pattern.Not started
3. Feature Access + Plan TierWrite policy/feature_access Rego with plan tier hierarchy + feature→tier map. Create plan tier overlay bundles (policies/overlays/{free,starter,growth,enterprise}/). Integrate into gateways and generation workflow. Per-toggle audit of ad-hoc feature gates to classify entitlement vs release flag; migrate only the entitlements (release flags route to Flipt instead).Not started
4. RBAC MigrationWrite policy/rbac Rego with role → action mapping. Migrate role_resolver.go and scope_checker.go logic to OPA. Integrate into KrakenD plugin / gateway deriveRequestContext(). Shadow-mode validation against existing role→scope behaviour. Remove hardcoded role maps from auth service.Not started
5. Tool Execution + MemoryWrite policy/tool_execution Rego (HIPAA approval, trusted MCP servers). Write policy/memory Rego with consent integration from GDPR Readiness PRD. Migrate partitionToolCalls() pattern matching and mem0Config.GetEnabled() to OPA. Add HIPAA mode support.Not started
6. Tenant Self-Service Policy ConfigAdd UpdateTenantPolicyOverrides / UpdateProjectPolicyOverrides / GetEffective* RPCs to InternalAdminGateway. Implement the four-layer merge with GetEffective* so admins can see actual policy in force. Bundle build pipeline includes tenant-specific overlays. Admin dashboard surface for policy configuration.Not started

Dependency ordering

Phase dependency graph — six phases plus an external Billing PRD dependency. Phase 1 (OPA Infrastructure + pkg/policy) is the central unlock: it ships the sidecar, GCS bundles, BundleBuilder service, and decision logging. Hard dependency arrows fan out from Phase 1 to Phase 2 (Model Access + Data Residency), Phase 3 (Feature Access + Plan Tier), Phase 4 (RBAC Migration), and Phase 5 (Tool Execution + Memory) — these four phases are independent of each other and can run in parallel. Phase 6 (Tenant Self-Service Config) depends on Phase 2 minimum so GetEffective* has at least one populated rule layer beyond platform base. A separate dashed dependency edge runs from Phase 1 to the external Billing PRD Phase 5 (Real-Time Enforcement — opa-consumer + gateway middleware) showing the cross-PRD blocker: Billing&#39;s real-time entitlement enforcement cannot ship until this PRD&#39;s Phase 1 lands

Phase depends onReason
Phase 2 depends on Phase 1OPA sidecar + pkg/policy + bundle pipeline must exist before any domain has somewhere to evaluate
Phases 3, 4, 5 depend on Phase 1Same — each is an additional domain on the substrate Phase 1 ships
Phase 6 depends on Phase 2 minimumFirst domain-with-real-rules must be in production before tenant self-service can meaningfully expose overrides; the GetEffective* merge view assumes at least one populated layer beyond platform base
Billing Phase 5 depends on this PRD's Phase 1The opa-consumer and gateway middleware in the Billing Roadmap need OPA infrastructure (sidecar deployment, pkg/policy client, GCS bundle server, decision logging) — explicit cross-PRD blocker
Domains within Phases 2–5 migrate independentlyEach enforcement point follows shadow → active → cleanup independently; one domain's regression does not block another

Out of Scope for MVP

  • Tenant-authored Rego policies — security risk. Tenants configure via structured TenantPolicyOverrides / ProjectPolicyOverrides, not raw Rego. Custom Rego stays in policies/tenants/{tid}/ under platform engineering control.
  • Real-time policy evaluation for streaming tokens — per-token evaluation is too expensive. Policy evaluates at request start. AI Safety streaming controls are a separate concern.
  • A/B testing policies — future enhancement. Not needed for initial compliance/safety use cases. Release-engineering A/B is handled by Flipt anyway.
  • Cost-based policy (budget limits) — handled by Lago entitlements, not OPA. OPA reads the projected entitlement state via opa-consumer in billing Phase 5, but the source of truth is Lago.
  • Network-level policy (IP allowlists at LB) — infrastructure concern; IP allowlists enforced at KrakenD/LB, not OPA. OPA can read input.context.ip for decision logging but does not own ingress firewalling.
  • Release feature flags — canary rollouts, A/B variants, kill switches, time-bounded experiments. Owned by Flipt via OpenFeature per PR #1270. Misclassifying a release flag as an entitlement is the most common adoption error.

Open Questions

#QuestionOwnerStatus
1OPA sidecar vs shared OPA service?InfrastructureLeaning sidecar — latency is critical, ~50MB per pod is cheap
2Should OPA replace OpenFGA for RBAC or complement it?EngineeringLeaning complement — OpenFGA for relationship-based (is user in org?), OPA for policy-based (can role do action?). Both coexist; the call site decides which to ask
3How to handle policy versioning for rollback?EngineeringOpen — GCS object versioning on bundles vs git-tag-based bundle builds
4Do AI Safety content-filtering rules (PR #1227) belong in OPA or a dedicated ML service?EngineeringOpen — OPA handles structured access policy cleanly, but PII detection / toxicity scoring may need a separate ML-based service that OPA calls rather than implementing in Rego
5Should policy decisions be exposed via API response header (e.g., X-Policy-Decision)?EngineeringOpen — useful for debugging, but risks leaking policy internals to clients

Cross-References