Policy Engine (OPA)

End-to-end view of how policy decisions — model access, feature entitlements, data residency, tool approval, memory consent, RBAC — become a first-class platform primitive evaluated by a single engine. This page consolidates the Tenant Policy Engine PRD (v0.2, Draft) and its dependencies into a single reference. The design is complete; the 6-phase rollout has not started.

Source PRDs

This page is derived from the Tenant Policy Engine PRD and its closely related dependencies:

Primary (docs/prd/):

Tenant Policy Engine — the v0.2 design: OPA sidecar topology, dual-source bundles (CI-built Rego + Kafka-synced data), six policy domains, per-domain fail-open/closed posture, four-layer override merge, six-phase incremental migration

Dependencies (docs/prd/multi-tenancy/):

02 · Projects — establishes (tenant_id, project_id) as the policy boundary. Rego rules key into data.projects[tid][pid] for every evaluation
03 · Tenant Onboarding — InternalAdminGateway is the surface for UpdateTenantPolicyOverrides / UpdateProjectPolicyOverrides RPCs
07 · Data Residency — the regional model data_residency policy enforces

Related:

GDPR Readiness PRD — consent tracking the memory policy reads (input.user.consent.memory_extraction)
AI Safety PRD (PR #1227) — content moderation rules; open question whether they live in OPA or a dedicated ML service
User-Aware Generation PRD — user attribution surfaced into input.user
Billing, Metering & Invoicing Roadmap — billing Phase 5 (real-time entitlement enforcement) blocks on this PRD's Phase 1 for opa-consumer and gateway middleware
GCP Eventarc → Kafka Pipeline Roadmap — sibling pipeline; the data-bundle sync reuses the same Pub/Sub → Kafka bridge
Tenant & Project Lifecycle Roadmap — multi-tenant substrate; tenant/project Firestore docs are the source of truth the data bundle projects from
Edge Idempotency Roadmap — typed-subject format (user:*, tenant_member:*, agent:*) used in the policy input.user.id and subject namespacing

Architectural Direction — One Engine, Two Bundles, Composable Decisions

The whole design refuses to invent per-domain policy services or scatter if tenant.tier == "x" checks across the codebase:

One engine for every policy domain. Model access, feature entitlements, data residency, tool execution, memory consent, RBAC — six domains, one OPA sidecar, one pkg/policy.Evaluate(path, input) call. A gateway request that needs RBAC + model access + feature access is one composable decision, not three Redis lookups.
Sidecar, not shared service. Each Restate pod gets its own OPA on localhost:8181. Sub-millisecond evaluation, no network hop, no shared-service blast radius. The trade-off is ~50MB memory per pod — cheap relative to the latency saved.
Dual-source bundles. Rego rules ship from git via CI to GCS (policies.tar.gz, polled every 30–60s). Tenant/project config ships from Firestore via Kafka to GCS (data.tar.gz, polled every 5–10s). Logic and data have different sources, different cadences, and different change owners — but compose at evaluation time via data.tenants[input.tenant_id].
Input is request-shaped, data is platform-shaped. Services pass per-request context (tenant_id, project_id, user.role, action, resource) in input. Long-lived tenant tier, region, plan, feature flags live in data.*. Per-request payload stays ~200 bytes; Rego has full platform + tenant + project config in memory.
Restrictions stack additively across four layers. Platform base → plan-tier overlay → tenant override → project override. Denylists union across layers; allowlists pick the most-specific non-empty layer; restriction-direction-only scalars (HIPAA, retention) ratchet upward. A higher layer cannot unblock what a lower layer denied.
Entitlements vs release flags is a hard line. Long-lived tier-driven decisions (Free can't access voice, EU can't use US-hosted models) belong in OPA. Time-bounded toggles a PM flips from a UI (new streaming UI at 10% canary, kill switch) belong in Flipt via OpenFeature. Misclassifying the second as the first is what bloated the current scattered enforcement.

Canonical reference: Tenant Policy Engine PRD. The PRD's "What Moves to OPA vs What Stays in Code" section catalogs every current enforcement file with its target domain and migration phase.

Glossary

Term	Definition
OPA	Open Policy Agent — CNCF-graduated policy engine. Evaluates Rego programs against JSON `input` + `data`, returns structured decisions. Self-hosted as a per-pod sidecar in this design.
Rego	OPA's declarative policy language. Used for all rule logic in `policies/*.rego`. Composable, testable via `opa test`, lint-checked in CI.
Policy bundle	A `.tar.gz` containing compiled Rego + optional data. OPA pulls bundles from a remote source on a polling interval and reloads on change.
Data bundle	A separate `.tar.gz` containing tenant/project/platform config as JSON. Loaded into `data.tenants[]`, `data.projects[][]`, `data.models.`, etc. — referenced by Rego rules at eval time.
`pkg/policy`	Greenfield Go client library (new module). Thin HTTP wrapper around the OPA REST API. Single method: `Evaluate(ctx, path, input) → (*Decision, error)`. Same pattern as `pkg/cache`, `pkg/secrets`.
Decision	`{ allow: bool, reasons: []string, obligations?: ... }` — Rego returns structured output so callers can attach human-readable deny reasons and domain-specific obligations (e.g., `retention_days`, `log_level`).
Policy domain	A Rego package corresponding to one enforcement concern: `policy/model_access`, `policy/feature_access`, `policy/data_residency`, `policy/tool_execution`, `policy/memory`, `policy/rbac`.
Shadow mode	Migration phase 1: call OPA alongside existing code, log both decisions, alert on disagreement, but use the existing code's verdict. Catches edge cases before swapping authority.
Bundle builder	`services/policybundlebuilder/v1/` — new Restate service. Kafka consumer on `platform.config.changes`. Reads affected tenant/project docs from Firestore, writes updated JSON to GCS data bundle path, debounces bursts.
`__platform__` project	Reserved synthetic project ID emitted by the bundle builder for every tenant. Permissive defaults. Callers without a real project (admin tools, cron, legacy endpoints) pass `project_id="__platform__"` so Rego doesn't need `if project` guards. `InternalAdminGateway` rejects user creation of this id.
Fail-open / fail-closed	Per-domain config for what happens when OPA is unreachable. Safety-critical domains (`model_access`, `data_residency`, `tool_execution`, `rbac`) fail-closed. Availability-sensitive domains (`feature_access`, `memory` search) fail-open.
Entitlement	Long-lived tenant-tier / compliance / plan / region-driven capability. Lives in OPA. Example: "EU tenants can only use models with EU DPAs."
Release flag	Time-bounded product/engineering toggle (canary, A/B, kill switch). Lives in Flipt via OpenFeature per PR #1270, not OPA.
Override layer	One of four merge layers: platform base (`policies/platform/`), plan-tier overlay (`policies/overlays/{tier}/`), tenant override (`TenantPolicyOverrides` on TenantProfileState), project override (`ProjectPolicyOverrides` on ProjectProfileState).

Service & Component Inventory

New Services

Component	Purpose	PRD section
`pkg/policy/` (`pkg/policy/client.go`, `go.mod`)	Thin Go HTTP client wrapping OPA's `/v1/data/{path}` REST endpoint. One `Evaluate()` method that returns a structured `Decision`. Cross-cutting concern shipped as a shared module.	FR-1.3
`services/policybundlebuilder/v1/` (BundleBuilder)	Restate service + Kafka consumer on `platform.config.changes`. Reads Firestore tenant/project docs, writes updated JSON to GCS data-bundle path, debounces bursts (5s window), emits the synthetic `__platform__` project for every tenant.	FR-5C
OPA sidecar container (in every Restate pod)	Open-source `openpolicyagent/opa` image. Configured with two bundle sources (policies + data), local decision log shipper, listens on `localhost:8181` only.	FR-1.1, FR-1.2, FR-5B.1

New Proto Definitions

Definition	Location	Notes
`TenantPolicyOverrides`	`apis/platform/v1/services/tenant/types.proto`	Replaces the stub `PolicyOverrides { map<string, string> overrides = 1; }` at TenantProfileState field 16. The stub has no production readers (only generated MCP wrappers / defaults code), so the rename is safe. Fields: `model_allowlist`, `model_denylist`, `blocked_mcp_servers`, `require_tool_approval_all`, `hipaa_mode`, `feature_overrides` (`map<string, bool>`), `memory_enabled`, `phi_retention_years`.
`ProjectPolicyOverrides`	`apis/platform/v1/services/project/types.proto` (greenfield)	Attached to the future `ProjectProfileState`. Fields: `allowed_models`, `disabled_features`, `require_tool_approval`, `require_classification`, `allowed_classifications`, `memory_enabled`, `custom_retention_days`. Until `ProjectProfileState` lands alongside the `ProjectProvisioningWorkflow`, project overrides are not persisted — Rego sees only the synthetic `__platform__` project.
InternalAdminGateway RPCs	`apis/platform/v1/services/internal-admin-gateway/`	`UpdateTenantPolicyOverrides` / `GetEffectiveTenantPolicy` / `UpdateProjectPolicyOverrides` / `GetEffectiveProjectPolicy`.

Integrated Components (No Changes Required)

Component	Role
GCS	Hosts both bundle objects: `gs://platform-opa-bundles/platform/policies.tar.gz` and `.../data.tar.gz`. ETag-driven `If-None-Match` lets OPA only download when content changed.
Firestore	Source of truth for tenant/project config. Firestore change stream emits to the existing Pub/Sub topic the bundle builder consumes via Kafka.
Pub/Sub → Kafka bridge	Reuses the same pipeline as the GCP Eventarc → Kafka Pipeline. Tenant/project doc changes route to `platform.config.changes`.
Loki / OTLP Collector	OPA decision logs ship via stdout → Promtail/OTLP → Loki for short-term queryability, plus a Kafka audit topic for long-term retention. See FR-6.
OpenFGA	Stays. Owns relationship-based authorization ("is user U a member of org O?", "is user U the owner of resource R?"). OPA owns policy-based authorization ("can role R do action A?"). Both engines coexist — see Appendix B in the PRD.
Unkey	Stays. Per-API-key rate limiting is not a policy decision and remains in Unkey.
Lago	Stays. Quota / metering / `current_usage` enforcement is billing data, not OPA. Real-time entitlement state derived from Lago lands in OPA via the billing roadmap's `opa-consumer` (Phase 5 there = Phase 1 here).

Observability Additions

Signal	Purpose
OPA decision log per evaluation	`decision_id`, structured `input`, `result`, `reasons`, `timer_rego_query_eval_ns` — Loki-shipped, queryable by tenant / user / action / time
Shadow-mode disagreement alert	During migration phase 1, every divergence between OPA and existing code logs and alerts — surfaces edge cases before swapping authority
Bundle sync staleness metric	Age of last-applied policy / data bundle per pod — alerts if sync lag exceeds 60s (policies) or 30s (data)
Bundle builder Kafka consumer lag	Detects backpressure if Firestore change rate exceeds debounce throughput
`decision_id` propagation	Each deny response includes the decision id for support-ticket forensics; tied to the Loki entry

Bundle Architecture — Logic and Data on Different Pipes

OPA bundles carry two distinct kinds of content with different sources and different update cadences. Conflating them is what makes OPA deployments brittle in practice; splitting them is the whole architectural move.

Content type	Source	Update cadence	Example
Rego rules (logic)	Git repo (`policies/`) → CI build → GCS `policies.tar.gz`	On merge to main (typically minutes)	"PHI classification requires retention"
Data files (config)	Firestore → Pub/Sub → Kafka → bundle builder → GCS `data.tar.gz`	Real-time (seconds, debounced)	"mayo-clinic PHI retention = 10 years"

Repo layout

policies/
├── platform/                    # Base platform policies (all tenants)
│   ├── model_access.rego
│   ├── feature_access.rego
│   ├── data_residency.rego
│   ├── tool_execution.rego
│   ├── memory.rego
│   ├── rbac.rego
│   └── data.json               # Shared static data (feature → tier map, plan hierarchy)
├── overlays/                    # Per-plan-tier overrides
│   ├── free/feature_access.rego
│   ├── enterprise/model_access.rego
│   └── hipaa/tool_execution.rego
├── tenants/                     # Per-tenant custom policies (rare)
│   └── bigbank/model_access.rego
└── tests/                       # opa test fixtures, one per package
    ├── model_access_test.rego
    └── ...

CI build pipeline

opa test policies/ -v                                                      # gate every PR
opa build -b policies/platform -b policies/overlays/${PLAN_TIER} -o policies.tar.gz
gsutil cp policies.tar.gz gs://platform-opa-bundles/platform/policies.tar.gz

Kafka → OPA data sync

┌─────────────────────┐   ┌─────────┐   ┌───────────────────┐   ┌────────────┐   ┌──────────────┐
│ InternalAdminGateway │──▶│ Firestore │──▶│ Firestore CDC     │──▶│ Pub/Sub →  │──▶│ BundleBuilder │
│ / ProjectWorkflow    │   │           │   │ change stream     │   │ Kafka       │   │ (Restate svc) │
└─────────────────────┘   └─────────┘   └───────────────────┘   └────────────┘   └──────┬──────┘
                                                                                       │
                                                                          Push to GCS data.tar.gz
                                                                                       │
                                                                                       ▼
                                                                              ┌──────────────┐
                                                                              │ OPA sidecars  │
                                                                              │ poll every 5s │
                                                                              └──────────────┘

End-to-end latency: Firestore write → OPA has new config in ~10–30 seconds.

Why not direct Firestore → OPA

OPA doesn't natively read from Firestore (only HTTP bundles or push).
Kafka adds durability — if the bundle builder is briefly down, events are not lost.
Same Pub/Sub → Kafka bridge already exists for GCS and other event sources, so this is a new consumer, not a new pipeline.
The bundle builder can debounce bursts (5s window) into one GCS write, avoiding write-storm during bulk tenant operations.

Fallback behaviour

If the Kafka pipeline is down, OPA continues with the last-synced data bundle. Stale-but-safe by construction: policies only become more restrictive with stale data, never less (denylists union, allowlists pick most-specific non-empty, scalars ratchet toward restriction). A reconciliation job periodically rebuilds the full data bundle from Firestore as a consistency check.

`platform` synthetic project

For every tenant, the bundle builder emits a synthetic project __platform__ with permissive defaults. This guarantees data.projects[tid][pid] always resolves and removes the need for if project guards in every Rego rule. Callers without a real project (admin paths, cron jobs, legacy endpoints) pass project_id="__platform__".

{
  "allowed_models": [],
  "disabled_features": [],
  "require_tool_approval": false,
  "require_classification": false,
  "allowed_classifications": [],
  "memory_enabled": true,
  "custom_retention_days": 0
}

The __platform__ project name is reserved — InternalAdminGateway rejects creation with this id.

Policy Domains

Six domains, one Rego package each, one enforcement-point family each. Every domain's failure mode and migration phase is locked in the PRD.

Domain	Rego package	What it decides	Replaces	Fail-on-OPA-down	Phase
Model access	`policy/model_access`	"Can this tenant/user use this model?" Region restrictions, plan-tier model lists, project allowlists	`services/openrouter/v1/metadata_filter.go` (access portion; metadata filters for capabilities stay)	Closed (deny)	2
Feature access	`policy/feature_access`	"Does this plan tier / project allow this feature?" `rag`, `voice`, `multi_agent`, `memory`, `webhooks`, `custom_models`, … gated by plan hierarchy	Tier-driven hardcoded checks (NOT release flags — see scope note)	Open (allow)	3
Data residency	`policy/data_residency`	"Does this action keep data in the tenant's region?" Webhook endpoint URL region check; LLM provider EU-DPA requirement	New capability — enables Data Residency PRD	Closed (deny)	2
Tool execution	`policy/tool_execution`	"Can this tool run, and does it need human approval?" HIPAA mode forces approval on all; specific tools require approval for non-admins; untrusted MCP servers blocked	`partitionToolCalls()` pattern matching in `workflows/generation/v1/impl.go`	Closed (deny)	5
Memory & consent	`policy/memory`	"Can we extract memories from this conversation? Can we search the user's memories?" Requires user consent for extraction; tenant/project toggle for search	`mem0Config.GetEnabled()` check in `workflows/generation/v1/memory_helpers.go`	Open for search, closed for extraction	5
RBAC	`policy/rbac`	"Can this role perform this action?" Role → action mapping (admin/owner full, developer scoped, viewer read-only)	Hardcoded role→scope map in `services/auth/v1/role_resolver.go` + `scope_checker.go`	Closed (deny)	4

Per-domain Rego semantics

Two patterns matter across all six packages:

Deny wins explicitly. A bare allow if { count(deny) == 0 } fails when the body fails — and OPA falls back to default allow := true, which silently allows the request despite a deny. The PRD locks the correct shape into every default-true package:

default allow := true

deny contains msg if { ... }

allow := false if {
    count(deny) > 0
}

Missing keys behave correctly under == comparisons. A search_allowed := false if tenant.memory_enabled == false rule does NOT match when tenant.memory_enabled is absent — the comparison is undefined, the rule body fails, and the default search_allowed := true stays. This is why scalar overrides in the PRD use explicit == false rather than != true.

Decision-tuple shape

Every domain returns allow: bool + optional reasons: [string]. Some domains add domain-specific fields the caller applies directly:

{
  "result": {
    "allow": false,
    "reasons": [
      "Model 'anthropic/claude-sonnet-4' is not in the EU-approved model list for tenant 'bigbank'",
      "Tenant region 'eu' requires models with EU data processing agreements"
    ],
    "obligations": {
      "log_level": "warn",
      "notify_admin": false
    }
  }
}

Storage upload returns retention_days, classification, access_default; tool execution returns require_approval: bool. These are documented in each domain's consuming PRD (e.g., Storage Model PRD §8.5 for storage obligations).

Input vs Data — What Goes Where

Services pass request-specific context in input. Tenant/project config lives in the data bundle.

Field	Source	Rationale
`tenant_id`, `project_id`	Input (per-request)	Identifies which data to look up
`user.id`, `user.role`, `user.scopes`	Input (per-request)	Changes per request
`action`, `resource.*`	Input (per-request)	What the user is trying to do
`context.ip`, `context.timestamp`	Input (per-request)	Audit / forensics
Tenant `plan_tier`, `data_region`, `hipaa_mode`, `phi_retention_years`	Data bundle	Changes rarely (admin action), shared across requests
Project `allowed_classifications`, `custom_retention_days`, `require_tool_approval`	Data bundle	Changes rarely (admin action), shared across requests
`data.models.eu_approved`, `data.models.eu_dpa_providers`, feature→tier map, plan hierarchy	Data bundle (platform-static)	Platform config, changes on deploy or admin action

Per-request input stays ~200 bytes while OPA has full platform + tenant + project config in memory. project_id is always required — callers without a real project pass "__platform__".

Four-Layer Override Merge

Merge semantics by field shape

Field shape	Examples	Merge rule
Denylists	`model_denylist`, `blocked_mcp_servers`, `disabled_features`	Additive across layers — every layer can add. Net deny = `union(platform, tier, tenant, project)`.
Allowlists	`model_allowlist`, `allowed_models`, `allowed_classifications`	Most-specific non-empty wins — a non-empty layer replaces lower layers. If `project.allowed_models` is non-empty it is used; else `tenant.model_allowlist`; else platform/tier defaults.
Booleans / scalars	`hipaa_mode`, `require_tool_approval`, `memory_enabled`, `phi_retention_years`	Most-specific-wins, restriction-direction only. A higher layer can flip toward more restrictive (`memory_enabled: false`, `require_tool_approval: true`, higher retention) but cannot relax a lower layer's restriction.

The single invariant

A higher layer can never unblock what a lower layer denied. A project allowlist of ["A"] does NOT grant access to model "B" if tenant.model_denylist contains "B". Allowlists narrow the available set; they do not override denylists. This invariant is what makes tenant self-service safe — tenants can ratchet down, never up.

Enforcement Integration

Call shape

Every enforcement point calls pkg/policy.Evaluate(). The standard call site:

projectID := reqCtx.GetProject()
if projectID == "" {
    projectID = "__platform__"   // FR-5C.4
}

decision, err := s.policyClient.Evaluate(ctx, "policy/model_access", map[string]any{
    "tenant_id":  reqCtx.GetTenant(),
    "project_id": projectID,
    "user":       map[string]any{"id": reqCtx.GetSubject(), "role": userRole},
    "action":     "llm.generate",
    "resource": map[string]any{
        "model":    effectiveConfig.GetModel(),
        "provider": resolveProvider(effectiveConfig.GetModel()),
    },
})
if err != nil {
    // Per-domain fail-open / fail-closed config below
    slog.Warn("policy evaluation failed", "error", err, "domain", "model_access")
}
if !decision.Allow {
    return nil, sdkgo.TerminalError(
        fmt.Errorf("policy denied: %s", strings.Join(decision.Reasons, "; ")),
        403,
    )
}

Incremental migration — three phases per enforcement point

Each enforcement point migrates independently. No big-bang cutover.

Phase 1 — Shadow mode (1 week per domain):
  - Add OPA call alongside existing code
  - Log both decisions, alert on disagreement
  - Existing code remains authoritative

Phase 2 — OPA active:
  - OPA decision becomes authoritative
  - Existing code remains as dead-code fallback (feature-flagged)
  - Monitor for regressions

Phase 3 — Cleanup:
  - Remove old if/else/switch logic
  - Remove feature flag
  - OPA is sole decision maker

Shadow-mode disagreements are how edge cases land in the test corpus before they land in production denials.

What Moves to OPA vs What Stays in Code

The line: policy decisions vs business logic

Move to OPA	Keep in code
"Can tenant X use model Y?"	"How do I call OpenRouter with model Y?"
"Does project P require file classification?"	"How do I write a Firestore doc?"
"Is this tool approved for this user?"	"How do I execute this tool?"
"What retention period applies?"	"How do I set GCS Object Retention?"
"Is this feature available on this plan?"	"How do I render the feature gate error?"
"Which models are allowed in EU?"	"How do I parse the OpenRouter response?"

Rule of thumb. If the answer could differ between two tenants or two projects with the same code deployed, it's a policy decision → OPA. If it's the same regardless of who's calling, it's business logic → stays in code.

Migration map (file → domain → phase)

Current enforcement	File	OPA domain	Migration phase
Model metadata filtering (access portion)	`services/openrouter/v1/metadata_filter.go`	`model_access`	Phase 2
`exclude_moderated` filter	`services/openrouter/v1/metadata_filter.go`	`model_access`	Phase 2
HTTPS-only webhook enforcement	`services/webhook/v1/tenant_helpers.go`	`data_residency`	Phase 2
`mem0Config.GetEnabled()` (if it's entitlement, not release flag)	`workflows/generation/v1/memory_helpers.go`	`feature_access`	Phase 3
Role → scope mapping	`services/auth/v1/role_resolver.go`	`rbac`	Phase 4
Scope checking	`services/auth/v1/scope_checker.go`	`rbac`	Phase 4
Tool approval pattern matching	`workflows/generation/v1/impl.go` (`partitionToolCalls`)	`tool_execution`	Phase 5
Memory extraction gate	`workflows/generation/v1/memory_helpers.go`	`memory`	Phase 5

Stays in code (and why)

Concern	Why it stays
Request validation (`buf.validate` annotations)	Schema enforcement, same for all tenants
Restate retry/timeout config	Infrastructure concern, not tenant-specific
OpenFGA relationship checks ("is user member of project?")	Relationship-based auth, not policy
Error code mapping (gRPC/HTTP status)	Deterministic transformation
Kafka topic routing	Infrastructure concern
Proto serialization / deserialization	Mechanical transformation
GCS signed URL generation	Implementation detail after policy allows the download
Quota enforcement (Lago `current_usage` check)	Metering/billing — Lago authoritative, not OPA
Rate limiting	Unkey per-key limits + KrakenD — different mechanism, different cadence

Entitlements vs release flags — the hard line

Two distinct categories of "this tenant can/can't use feature X" decisions belong in different systems:

Concern	System	Examples
Feature entitlements — long-lived rules driven by tenant tier, compliance regime, region, plan, or platform security policy	OPA (this PRD)	"Free tier can't access voice", "EU tenants can't use US-hosted models", "HIPAA-mode tenants must use HIPAA-approved models", "Enterprise tier gets longer retention"
Release feature flags — time-bounded toggles driven by product rollout, canary, A/B test, or kill switch	Flipt via OpenFeature (PR #1270)	"New streaming UI at 10% canary", "A/B test variant B for MCP panel", "kill switch on the buggy memory extractor", "beta tester allowlist for experimental model X"

Rule of thumb. If a PM wants to flip it from a UI without writing code or opening a policy PR, it's a release flag → Flipt. If a platform/security engineer would write it as a Rego rule that stays stable for months or years, it's an entitlement → OPA.

A per-toggle audit of ad-hoc gates like mem0Config.GetEnabled() is required before migration — if it's actually a product rollout rather than a tier-gated entitlement, it belongs in Flipt, not in the feature_access Rego policy.

Tenant Policy Configuration

Tenants and projects configure their overrides via InternalAdminGateway RPCs. Tenants never author raw Rego — they fill structured proto fields, which the bundle builder lifts into JSON the Rego rules read via data.tenants[tid] / data.projects[tid][pid].

// apis/platform/v1/services/tenant/types.proto

message TenantPolicyOverrides {
  repeated string model_allowlist = 1;
  repeated string model_denylist = 2;
  repeated string blocked_mcp_servers = 3;
  bool require_tool_approval_all = 4;
  bool hipaa_mode = 5;
  map<string, bool> feature_overrides = 6;
  bool memory_enabled = 7;
  int32 phi_retention_years = 8;
}

message ProjectPolicyOverrides {
  repeated string allowed_models = 1;
  repeated string disabled_features = 2;
  bool require_tool_approval = 3;
  bool require_classification = 4;
  repeated string allowed_classifications = 5;
  bool memory_enabled = 6;
  int32 custom_retention_days = 7;
}

message TenantProfileState {
  // ... existing fields 1-15 ...
  TenantPolicyOverrides tenant_policy_overrides = 16;   // was: PolicyOverrides policy_overrides
  // ... existing fields 90, 91 ...
}

ProjectProfileState does not yet exist — it lands alongside the ProjectProvisioningWorkflow and MUST carry ProjectPolicyOverrides project_policy_overrides. Until that proto ships, project overrides are not persisted; Rego sees only the synthetic __platform__ project for every request.

TenantPolicyOverrides migration note. The existing stub PolicyOverrides { map<string, string> overrides = 1; } at TenantProfileState field 16 has no production readers (only generated MCP wrappers and defaults code). The rename to TenantPolicyOverrides is safe — the Firestore field stays at position 16, only the type and field name change.

Edit Authority — Travila admin vs tenant admin

Edit authority is role-asymmetric: Travila admin authority is a strict superset of tenant admin authority. Travila admin can edit every configurable field; tenant admin can edit a defined self-serve subset.

Layer	Edit authority
Platform base (`policies/platform/*.rego`)	Travila only — git + CI
Plan-tier overlay (`policies/overlays/{tier}/`)	Travila only — git + CI
Tenant overrides (`TenantPolicyOverrides`)	Travila admin: all fields. Tenant admin: self-serve subset (TBD)
Project overrides (`ProjectPolicyOverrides`)	Travila admin: all fields. Tenant admin: self-serve subset (TBD)

Invariants regardless of role. Edit authority is which fields a role may write — it does not relax the four-layer merge. A Travila admin writing tenant.model_allowlist = ["X"] still cannot unblock something the platform base denies. All writes (Travila or tenant) traverse the same validate-then-persist pipeline.

Target surface split (final shape in a follow-up):

ConsoleGateway (/admin/v1/*) — exposes the self-serve subset of Update*PolicyOverrides / GetEffective* to authenticated tenant admins.
InternalAdminGateway (/internal/v1/*) — exposes the full field set to Travila staff, plus the ability to write/overwrite tenant-set fields for support and break-glass.

Deferred — what this roadmap does not commit to. The per-field role split (which fields are self-serve vs Travila-only) is not enumerated. TenantPolicyOverrides and ProjectPolicyOverrides are flat proto messages today; the split is a product/legal decision tracked as a follow-up. The MVP ships with full-set RPCs on InternalAdminGateway only; the ConsoleGateway subset lands once the per-field cut is decided.

Engineering posture while the cut is pending:

The proto messages MUST NOT bake role authority into the schema — the same proto wires into both gateways with different field-level validation.
Gateway implementations MUST NOT assume "tenant admin can write the whole message." Field-level authorization is a separate concern from the proto schema.
Audit logs MUST capture which principal (Travila admin vs tenant admin) made each policy mutation, so post-hoc review can verify the eventual subset rule.

Gateway RPCs

rpc UpdateTenantPolicyOverrides(UpdateTenantPolicyOverridesRequest)
   returns (UpdateTenantPolicyOverridesResponse);
rpc GetEffectiveTenantPolicy(GetEffectiveTenantPolicyRequest)
   returns (GetEffectiveTenantPolicyResponse);

rpc UpdateProjectPolicyOverrides(UpdateProjectPolicyOverridesRequest)
   returns (UpdateProjectPolicyOverridesResponse);
rpc GetEffectiveProjectPolicy(GetEffectiveProjectPolicyRequest)
   returns (GetEffectiveProjectPolicyResponse);

GetEffective* returns the merged result of platform → tier → tenant → project layers so admins can see what is actually in force without reasoning about layering by hand.

Decision Logging & Audit

OPA sidecar → stdout (JSON decision logs)
    → Promtail / OTLP Collector
    → Loki (queryable, short retention)
    → Kafka audit topic (long-term retention)

Every evaluation produces a log entry:

{
  "decision_id": "abc-123",
  "timestamp": "2026-03-31T10:00:00Z",
  "path": "policy/model_access",
  "input": {
    "tenant_id": "bigbank",
    "project_id": "trading-prod",
    "user": {"id": "alice", "role": "developer"},
    "action": "llm.generate",
    "resource": {"model": "openai/gpt-4o"}
  },
  "result": {
    "allow": false,
    "reasons": ["Model 'openai/gpt-4o' not approved for EU tenants"]
  },
  "metrics": {
    "timer_rego_query_eval_ns": 142000
  }
}

100% of deny decisions are logged (success metric). Logs are queryable by tenant, user, action, time range — "show me all denied model access requests for tenant X in the last 24 hours" is a single Loki query. The full pipeline depends on the audit logging infrastructure from Audit Logging PRD (PR #537).

Production Hardening

Per-domain fail-open / fail-closed

OPA sidecar unavailability does not block requests beyond the per-domain configured behaviour:

Domain	Default on OPA failure	Rationale
`model_access`	Fail closed	Safety / compliance critical — denying a request is better than serving from an unrestricted model list
`feature_access`	Fail open	Availability-sensitive — paid users shouldn't lose access during an OPA outage
`data_residency`	Fail closed	Compliance critical
`tool_execution`	Fail closed	Safety critical — tools may have side effects
`memory` (search)	Fail open	Search returns nothing if open and the memory store is also down
`memory` (extraction)	Fail closed	Consent gate — never extract without affirmative consent
`rbac`	Fail closed	Security critical

Bundle staleness

Policy bundles are cached locally on each pod. OPA continues evaluating with the last-synced bundle if the bundle server is unavailable. Stale-but-safe is structural — restrictions only ratchet toward more-restrictive across layers, so a stale bundle never grants permissions a fresh bundle would have denied. A staleness metric alerts on lag exceeding 60s (policies) or 30s (data).

Read-only bundles, validated overrides

Policy bundles are read-only at runtime — there is no API to mutate a loaded bundle. Tenants configure overrides via structured proto fields (TenantPolicyOverrides, ProjectPolicyOverrides), which the bundle builder validates and lifts into the data bundle. Tenant-submitted overrides cannot weaken Travila-level security policies — the four-layer merge invariant guarantees this regardless of what a tenant submits.

Performance targets

Operation	Target
OPA policy evaluation (local sidecar)	< 1ms p95
Policy bundle sync	< 30s after push
Decision log shipping	< 5s to Loki
End-to-end Firestore write → OPA has new config	~10–30s

Sidecar memory overhead is ~50MB per Restate pod. Explicit open question 1 weighs sidecar vs shared OPA service — current leaning is sidecar for the latency, with memory accepted as cheap.

Rollout Phases

The 6-phase plan from the PRD. Each phase is independent — domains migrate one at a time, no big-bang cutover.

Phase	Scope	Status
1. OPA Infrastructure + `pkg/policy`	Add OPA sidecar to Helm charts. Create `pkg/policy/` client (`Evaluate()`). Configure GCS bundle server + dual-source sync (policies + data). Deploy bundle builder Restate service consuming `platform.config.changes`. Wire decision logging to Loki + Kafka audit topic. Deploy to staging with allow-all base policies. Unblocks Billing Phase 5 (real-time entitlement enforcement) — see Billing Roadmap.	Not started
2. Model Access + Data Residency	Write `policy/model_access` Rego with EU-approved model list and project allowlist. Write `policy/data_residency` Rego (webhook region, LLM provider DPA). Integrate into generation workflow + webhook service in shadow mode. Add `TenantPolicyOverrides.model_allowlist/denylist`. Tests for every rule. Cut over from shadow → active per the three-step migration pattern.	Not started
3. Feature Access + Plan Tier	Write `policy/feature_access` Rego with plan tier hierarchy + feature→tier map. Create plan tier overlay bundles (`policies/overlays/{free,starter,growth,enterprise}/`). Integrate into gateways and generation workflow. Per-toggle audit of ad-hoc feature gates to classify entitlement vs release flag; migrate only the entitlements (release flags route to Flipt instead).	Not started
4. RBAC Migration	Write `policy/rbac` Rego with role → action mapping. Migrate `role_resolver.go` and `scope_checker.go` logic to OPA. Integrate into KrakenD plugin / gateway `deriveRequestContext()`. Shadow-mode validation against existing role→scope behaviour. Remove hardcoded role maps from auth service.	Not started
5. Tool Execution + Memory	Write `policy/tool_execution` Rego (HIPAA approval, trusted MCP servers). Write `policy/memory` Rego with consent integration from GDPR Readiness PRD. Migrate `partitionToolCalls()` pattern matching and `mem0Config.GetEnabled()` to OPA. Add HIPAA mode support.	Not started
6. Tenant Self-Service Policy Config	Add `UpdateTenantPolicyOverrides` / `UpdateProjectPolicyOverrides` / `GetEffective` RPCs to InternalAdminGateway. Implement the four-layer merge with `GetEffective` so admins can see actual policy in force. Bundle build pipeline includes tenant-specific overlays. Admin dashboard surface for policy configuration.	Not started

Dependency ordering

Phase depends on	Reason
Phase 2 depends on Phase 1	OPA sidecar + `pkg/policy` + bundle pipeline must exist before any domain has somewhere to evaluate
Phases 3, 4, 5 depend on Phase 1	Same — each is an additional domain on the substrate Phase 1 ships
Phase 6 depends on Phase 2 minimum	First domain-with-real-rules must be in production before tenant self-service can meaningfully expose overrides; the `GetEffective*` merge view assumes at least one populated layer beyond platform base
Billing Phase 5 depends on this PRD's Phase 1	The `opa-consumer` and gateway middleware in the Billing Roadmap need OPA infrastructure (sidecar deployment, `pkg/policy` client, GCS bundle server, decision logging) — explicit cross-PRD blocker
Domains within Phases 2–5 migrate independently	Each enforcement point follows shadow → active → cleanup independently; one domain's regression does not block another

Out of Scope for MVP

Tenant-authored Rego policies — security risk. Tenants configure via structured TenantPolicyOverrides / ProjectPolicyOverrides, not raw Rego. Custom Rego stays in policies/tenants/{tid}/ under platform engineering control.
Real-time policy evaluation for streaming tokens — per-token evaluation is too expensive. Policy evaluates at request start. AI Safety streaming controls are a separate concern.
A/B testing policies — future enhancement. Not needed for initial compliance/safety use cases. Release-engineering A/B is handled by Flipt anyway.
Cost-based policy (budget limits) — handled by Lago entitlements, not OPA. OPA reads the projected entitlement state via opa-consumer in billing Phase 5, but the source of truth is Lago.
Network-level policy (IP allowlists at LB) — infrastructure concern; IP allowlists enforced at KrakenD/LB, not OPA. OPA can read input.context.ip for decision logging but does not own ingress firewalling.
Release feature flags — canary rollouts, A/B variants, kill switches, time-bounded experiments. Owned by Flipt via OpenFeature per PR #1270. Misclassifying a release flag as an entitlement is the most common adoption error.

Open Questions

#	Question	Owner	Status
1	OPA sidecar vs shared OPA service?	Infrastructure	Leaning sidecar — latency is critical, ~50MB per pod is cheap
2	Should OPA replace OpenFGA for RBAC or complement it?	Engineering	Leaning complement — OpenFGA for relationship-based (`is user in org?`), OPA for policy-based (`can role do action?`). Both coexist; the call site decides which to ask
3	How to handle policy versioning for rollback?	Engineering	Open — GCS object versioning on bundles vs git-tag-based bundle builds
4	Do AI Safety content-filtering rules (PR #1227) belong in OPA or a dedicated ML service?	Engineering	Open — OPA handles structured access policy cleanly, but PII detection / toxicity scoring may need a separate ML-based service that OPA calls rather than implementing in Rego
5	Should policy decisions be exposed via API response header (e.g., `X-Policy-Decision`)?	Engineering	Open — useful for debugging, but risks leaking policy internals to clients

Cross-References

Tenant Policy Engine PRD — the v0.2 design in full
Billing, Metering & Invoicing Roadmap — billing Phase 5 (real-time enforcement) blocks on this PRD's Phase 1
Tenant & Project Lifecycle Roadmap — multi-tenant substrate; tenant/project Firestore docs are the source of truth the data bundle projects from
GCP Eventarc → Kafka Pipeline Roadmap — sibling pipeline; data-bundle sync reuses the same Pub/Sub → Kafka bridge
Edge Idempotency Roadmap — typed-subject format used in input.user.id
Tenant Onboarding PRD — InternalAdminGateway surface for override RPCs
Data Residency PRD — the regional model data_residency enforces
GDPR Readiness PRD — consent tracking the memory policy reads
AI Safety PRD (PR #1227) — content moderation; open question whether it lives in OPA
Feature Flags PRD (PR #1270) — Flipt / OpenFeature for release flags; the explicit complement to OPA's entitlement scope
Audit Logging PRD (PR #537) — decision-log pipeline this design ships into
Open Policy Agent docs — upstream reference
Rego language reference — for policy authors

Glossary​

Service & Component Inventory​

New Services​

New Proto Definitions​

Integrated Components (No Changes Required)​

Observability Additions​

Bundle Architecture — Logic and Data on Different Pipes​

Repo layout​

CI build pipeline​

Kafka → OPA data sync​

Why not direct Firestore → OPA​

Fallback behaviour​

__platform__ synthetic project​

Policy Domains​

Per-domain Rego semantics​

Decision-tuple shape​

Input vs Data — What Goes Where​

Four-Layer Override Merge​

Merge semantics by field shape​

The single invariant​

Enforcement Integration​

Call shape​

Incremental migration — three phases per enforcement point​

What Moves to OPA vs What Stays in Code​

The line: policy decisions vs business logic​

Migration map (file → domain → phase)​

Stays in code (and why)​

Entitlements vs release flags — the hard line​

Tenant Policy Configuration​

Edit Authority — Travila admin vs tenant admin​

Gateway RPCs​

Decision Logging & Audit​

Production Hardening​

Per-domain fail-open / fail-closed​

Bundle staleness​

Read-only bundles, validated overrides​

Performance targets​

Rollout Phases​

Dependency ordering​

Out of Scope for MVP​

Open Questions​

Cross-References​