RequestContext Refactor Roadmap
End-to-end view of how identity, scope, and routing become three separate concerns instead of one overloaded caller_key string. This page consolidates the RequestContext Refactor PRD (v0.5, Draft) and its dependencies into a single reference. The design is complete; the 9-phase rollout has not started, and this work is sequenced before the Tenant & Project Lifecycle and Edge Idempotency rollouts because every gateway, actor, workflow, and Kafka consumer downstream of those plans relies on the typed-subject and protovalidate guarantees this refactor establishes.
This page is derived from the RequestContext Refactor PRD and its closely related dependencies:
Primary (docs/prd/):
- RequestContext Refactor — the v0.5 design, three-concern split, EventContext identity gap, protovalidate enforcement, 9-phase migration
Dependencies (docs/prd/multi-tenancy/):
- 02 · Projects — adds
RequestContext.project(field 8) andEventContext.project_id(field 12); routing keys move to 3-parttenant:project:resourceconstructed at SDK call sites, never carried asRequestContextfields - 04 · Identity — Zitadel issues the
tenant_member:*typed subjects this refactor expects - 05 · Authorization — defines the typed-subject format (
user:*,agent:*,tenant_member:*) and is the blocker for Phase 4 (storage cannot drop the CallerKey-override hack until it can authorizeagent:conv-abcvia OpenFGA) - 06 · API Keys — OQ #3 adds
meta.tenant_member_idto every Unkey key so admin RPCs always carry atenant_member:*subject
Related:
- FILE_ID Resolution — runs in parallel; storage authorization changes align with the typed-subject model
- MCP Tool Annotations — the MCP service is the only service that today validates tenant context manually (becomes obsolete after Phase 0)
- Tenant & Project Lifecycle Roadmap — ships on top of this refactor's typed-subject and protovalidate guarantees
- Edge Idempotency Roadmap — sibling roadmap; the namespace rewrite uses the same typed-subject format this PRD establishes
The refactor implements a pattern every major platform that operates at scale has independently converged on: construct a canonical, structured identity context once at the edge, propagate it through every RPC, validate it at every hop, never let context-less requests reach a handler.
- Netflix Passport — protobuf-encoded identity built at the Zuul gateway, integrity-protected, propagated to every backend. Replaced O(n) per-service token parsing with O(1) edge construction. Our
RequestContextproto serves the same role. - Google BeyondProd — short-lived End-User Context (EUC) tickets validated at every hop alongside service-to-service mTLS. Missing or invalid context is a hard error — no fallback. Our
[(buf.validate.field).required = true]on everyctxfield is the same posture. - Google Zanzibar / SpiceDB / OpenFGA — typed subjects (
user:alicenot barealice) prevent ID collisions across entity types and let the authorization layer validate at tuple-write time, not check time. Oursubject = "type:id"convention maps directly. - CloudEvents
subjectattribute — the spec explicitly recommends: "If identity attributes happen to be part of the event data, the event producer SHOULD also add them to context attributes" — so routing layers can inspect identity without deserializing the payload. OurEventContext.subject(NEW field 7) closes the same async audit gap. - OWASP Multi-Tenant Cheat Sheet / NIST SP 800-207 — both mandate "establish tenant context early, derive from verified claims, reject requests lacking authenticated context." Today our protos have zero
buf.validateannotations on context messages — that gap is what Phase 0 closes.
Canonical reference: RequestContext Refactor PRD. The PRD's Production Validation section catalogs 30+ citations spanning Netflix, Google, Uber, Shopify, Confluent, Segment, DoorDash, Lyft, Temporal, Restate, Azure, AWS, OWASP, NIST, and W3C.
Glossary
| Term | Definition |
|---|---|
RequestContext | The synchronous-RPC envelope. Proto field ctx on every request message. Carries identity, scope, tracing. Defined in apis/common/v1/context.proto. |
EventContext | The asynchronous-Kafka envelope. Wraps every domain event with identity, correlation, and tenant scope. Built via pkg/restatex/event_context.go. |
subject | Typed Zanzibar principal — type:id. Three permitted types: user:* (end user), tenant_member:* (Zitadel-authenticated team member), agent:* (AI agent — typically agent:{conversation_id}). Validated min_len = 1. |
on_behalf_of | NEW field. Delegation chain. The original principal when subject is acting transitively (e.g., agent acting on a user's request → subject = "agent:conv-abc", on_behalf_of = "user:alice"). Empty for direct actions. |
session_id | NEW, optional. Client session identifier for cost attribution and analytics. From x-session-id header or derived from Firebase auth_time. Empty for server-to-server calls. |
tenant | Hard tenant boundary. Every request has exactly one. Validated min_len = 1. (Field name is tenant_id on EventContext.) |
trace_id | OTel trace ID for infrastructure observability, extracted from W3C traceparent at the gateway. NOT the business correlation ID. Lifetime: ms–seconds. Sampled. New trace on Kafka consume. |
correlation_id | Business correlation, on EventContext only. MUST always be conversation_id, set explicitly via WithCorrelationID(). Lifetime: days–weeks. Survives Kafka. Never sampled. |
caller_key | DEPRECATED. The single overloaded string this refactor replaces — meant different things at different layers (actor-routing key / conversation ID / user ID / Kafka partition key / tenant ID). |
| Typed Zanzibar subject | type:id convention from Google Zanzibar. The namespace prefix prevents ID collisions across entity types and is what OpenFGA's [type, type#relation] syntax validates at tuple-write time. |
| Protovalidate | The buf.validate plugin enforced inside every auto-generated *_restate_wrappers.go via w.Validator.Validate(req). The runtime is wired today; the rules don't exist yet. |
The CallerKey Overload
RequestContext.caller_key is a single string field that means five different things depending on which layer reads it:
| Layer | caller_key value | What it actually means |
|---|---|---|
| Gateway → Actor | tenant:userId | Restate actor routing key + storage partitioning |
| Conversation → Workflow | conversationId | Which conversation triggered the workflow (for event correlation) |
| Workflow → Storage | userId (re-overridden) | Who owns the files being accessed |
| Workflow → Kafka | conversationId | Kafka partition key for event ordering |
| Admin operations | tenantId alone | Tenant-level scope, no user dimension |
The override chain in production today (actors/conversation/v1/impl.go:413-425 → workflows/generation/v1/impl.go:1032-1036):
Each override is a symptom of one field carrying multiple concerns. Downstream services cannot tell what caller_key is without knowing which layer set it. The authorization PRD's typed-subject model has nowhere clean to live. And protovalidate has no annotations, so requests with an empty tenant or no context at all pass silently — the only manual check is if tenant == "" inside the MCP service.
Target State: One Field, One Concern
The refactor splits caller_key into purpose-specific fields. Each field carries exactly one concern at every layer.
message RequestContext {
// ─── Identity — WHO is acting ──────────────────────────────────
string subject = 1 [(buf.validate.field).string.min_len = 1]; // typed: "user:alice"
string on_behalf_of = 6; // NEW — delegation chain
// ─── Scope — WHERE the action is authorized ────────────────────
string tenant = 2 [(buf.validate.field).string.min_len = 1];
// ─── Tracing — HOW to correlate logs/traces ────────────────────
string trace_id = 3; // OTel, infra-only
// ─── Session — for cost attribution ────────────────────────────
string session_id = 7; // NEW — optional
// ─── Existing ──────────────────────────────────────────────────
map<string, string> metadata = 4;
// ─── Deprecated ────────────────────────────────────────────────
string caller_key = 5 [deprecated = true];
}
message EventContext {
// ─── Event identity ────────────────────────────────────────────
string event_name = 1 [(buf.validate.field).string.min_len = 1];
string version = 2 [(buf.validate.field).string.min_len = 1];
string event_id = 3 [(buf.validate.field).string.min_len = 1];
// ─── Correlation — ALWAYS conversation_id ──────────────────────
string correlation_id = 4 [(buf.validate.field).string.min_len = 1];
string emitted_at = 5 [(buf.validate.field).string.min_len = 1];
// ─── Identity (NEW) ────────────────────────────────────────────
string subject = 7; // NEW — flows from RequestContext
string on_behalf_of = 8; // NEW — delegation flows
string session_id = 9; // NEW — for metering
// ─── Scope ─────────────────────────────────────────────────────
string tenant_id = 11 [(buf.validate.field).string.min_len = 1];
// ─── Existing / Deprecated ─────────────────────────────────────
map<string, string> metadata = 10;
string caller_key = 6 [deprecated = true];
}
Routing keys move to SDK call sites. Restate routing was already passed at client construction — the refactor stops smuggling it through RequestContext:
// Routing key belongs in the SDK call, NOT in RequestContext
conv := convpb.NewConversationActorServiceClient(ctx, conversationKey)
storage := storagepb.NewStorageManagerActorServiceClient(ctx, storageKey)
The ID Hierarchy
The system has seven distinct identifiers serving different purposes at different scopes. Conflating any two of them was the root cause of the override chain.
Tenant ("socayo") ← tenant: hard boundary, every request
└─ User ("alice") ← subject: typed identity (user:alice)
└─ Session (app open → close) ← session_id: NEW, optional
└─ Conversation ("conv-abc") ← correlation_id: primary business correlation
└─ Run ("run-xyz") ← run_id: one workflow execution (top-level event field)
└─ Tool Call ← tool_call_id: one MCP invocation
└─ Request ← trace_id: OTel, one HTTP request fan-out
trace_id vs correlation_id — They Are Not the Same
trace_id (OTel) | correlation_id (Business) | |
|---|---|---|
| Scope | One HTTP request fan-out | Entire conversation (days/weeks) |
| Lifetime | Milliseconds to seconds | Days to weeks |
| Survives Kafka? | No — new trace on consume | Yes — carried in event payload |
| Sampled? | Yes (head/tail sampling) | Never |
| Standard | W3C traceparent header | App-defined |
| Purpose | "Why was this request slow?" | "Show me everything for conversation X" |
Current bug — correlation_id inconsistency. Nobody passes WithCorrelationID() to restatex.Publish() today. The fallback in pkg/restatex/kafka.go uses restate.Key(ctx), which returns run_id from a workflow and conversation_id from an actor — events from the same conversation end up with different correlation_id values. Phase 5 makes WithCorrelationID(conversationId) explicit at every call site and considers removing the fallback entirely (OQ #9).
Component Inventory
Touched by the Refactor
| Component | What changes | PRD section |
|---|---|---|
apis/common/v1/context.proto | Adds on_behalf_of (RC field 6), session_id (RC field 7); adds subject, on_behalf_of, session_id to EventContext (fields 7–9); adds buf.validate rules; marks caller_key deprecated | §Target State, §Phase 0–1 |
All *Request messages | Add [(buf.validate.field).required = true] on ctx; standardize on ctx = 1 (some gateways carry ctx = 100 today) | §Phase 0, §Phase 7 |
pkg/tenancy / new pkg/requestcontext | Centralizes deriveRequestContext() — currently copy-pasted across 6+ gateways — into a single FromHeaders(headers) factory | §Phase 2 |
6 gateways (gateway, storage-gateway, notification-gateway, webhook-gateway, integrations-gateway, apikey-gateway) | Set typed subjects (subject = "user:{id}" or "tenant_member:{id}"); call the centralized factory; ctx = 100 → ctx = 1 | §Phase 2, §Phase 7 |
actors/conversation/v1/impl.go | Replaces the CallerKey = conversationId override with subject = "agent:{conversationId}" + on_behalf_of = original subject | §Phase 3 |
workflows/generation/v1/impl.go | Removes the storage CallerKey-override hack (line 1032); derives Kafka partition key from subject instead of caller_key; reads identity from subject for event correlation | §Phase 4, §Phase 5 |
pkg/restatex/event_context.go | BuildEventContext() propagates subject + on_behalf_of + session_id from RequestContext; drops callerKey parameter | §Phase 5 |
actors/firebasebridge/v1/impl.go | Extracts conversation ID from EventContext.subject instead of caller_key | §Phase 6 |
Existing Surfaces with Missing or Inconsistent Context
| Proto file | Gap | Severity |
|---|---|---|
apis/auth/v1/services/openfga/openfga.proto | All 13 RPCs missing ctx | Critical — authorization service has no caller scope |
apis/notification/v1/services/gateway/gateway.proto | All 55+ RPCs missing ctx | Critical — notification ops have no tenant scope |
apis/github/v1/github.proto | All RPCs missing ctx | High |
apis/rstt/v1/services/admin/admin.proto, introspection.proto, deployment.proto | Missing ctx on Restate-admin paths | High |
apis/llm/v1/workflows/generation/generation.proto | GetStateRequest missing ctx | Medium |
apis/pipecat/v1/services/cerebrium/cerebrium.proto | InvokeRequest missing ctx | Medium |
| All gateway protos | ctx = 100 instead of ctx = 1 (notification inner services use field name context, not ctx) | Pre-user; mechanical fix |
services/gateway/v1/impl.go:387 | Real bug: ListPendingApprovals derives reqCtx but doesn't pass it on the downstream call | Independent of refactor; fix in passing |
Integrated (No Changes Required)
| Component | Role |
|---|---|
buf.build/bufbuild/protovalidate | Already imported across 20+ services. Wrappers already call Validator.Validate(req) — Phase 0 just adds the rules to enforce. |
| Zitadel | Issues the tenant_member:* typed subjects this refactor consumes (no changes — Identity PRD already accepted). |
buf.build/bufbuild/protovalidate-go | Runtime validator, no changes. |
| Restate Go SDK | Routing keys flow through SDK constructors, where they belong. |
Flow 1: External Request — Typed Subject from the Edge
The canonical happy path. KrakenD authenticates, derives the typed subject, and the same RequestContext flows unmodified through every hop.
Client POST /api/v1/llm/gateway/send-message
Authorization: Bearer <Zitadel JWT> (or pk_live_xxx)
x-session-id: <client UUID> (optional)
▼
KrakenD auth plugin
Validates token → injects x-tenant-id, x-user-id, x-session-id
(For sk_/pk_ paths: also resolves meta.tenant_member_id from Unkey)
▼
LLMGateway.SendMessage
pkg/tenancy.FromHeaders(headers) builds RequestContext:
subject = "user:firebase_uid_123" ← typed
tenant = "socayo" ← validated min_len=1
on_behalf_of = ""
session_id = "<client UUID or auth_time>"
trace_id = <from W3C traceparent>
Routes to ConversationActor via SDK:
convpb.NewConversationActorServiceClient(ctx, "socayo:conv-abc")
▼
ConversationActor → MCPService → … → Workflow → Storage
Same RequestContext at every hop. Each wrapper calls
Validator.Validate(req) before invoking the handler — empty
tenant or empty subject = TerminalError(400, "ctx.tenant must be at least 1 character").
Why at the gateway, not the handler. The MCP service is the only service today that manually enforces tenant != "". Every other service trusts whatever it gets — a bug that violates OWASP's "establish tenant context early" mandate. Phase 0 makes the check automatic at every wrapper, not optional in one service.
Flow 2: Conversation Acting on Behalf of User
The override-elimination flow. The conversation is the principal for the workflow's actions; the original user is preserved in on_behalf_of.
ConversationActor (handles incoming message from user:alice)
Receives RequestContext { subject: "user:alice", tenant: "socayo", … }
▼
ConversationActor → GenerationWorkflow.Run
Constructs new RequestContext for the workflow:
subject = "agent:conv-abc" ← conversation IS the principal
tenant = "socayo" ← preserved
on_behalf_of = "user:alice" ← user preserved in delegation chain
(no caller_key override — that field is dead)
Routes to workflow via SDK: wfpb.NewGenerationWorkflowServiceClient(ctx, runId)
▼
GenerationWorkflow → StorageGateway (e.g., fetch frame)
Passes the SAME RequestContext through. No re-override.
Storage authorizes via OpenFGA:
Check(agent:conv-abc, can_view, file:xyz)
(OpenFGA enforcement is the Phase 4 blocker — until storage can
authorize "agent:*" subjects, the override hack stays.)
▼
GenerationWorkflow → Kafka publish
EventContext built from RequestContext via BuildEventContext():
subject = "agent:conv-abc"
on_behalf_of = "user:alice"
tenant_id = "socayo"
correlation_id = "conv-abc" ← ALWAYS conversation_id, explicit
Kafka partition key:
tenancy.ResourceID(subject) → "conv-abc"
(Derived from subject — no longer reads caller_key.)
Where audit comes from. With both subject and on_behalf_of on the event envelope, "show every action conv-abc took for user:alice in the last 24h" is a single SQL filter on consumed events. Today, that question is unanswerable — only caller_key = "conv-abc" survives, and the user identity is gone.
Flow 3: Async Identity in Events
BuildEventContext() after the refactor:
func BuildEventContext(ctx sdkgo.Context, eventName, correlationID string,
requestCtx *commonv1.RequestContext) *commonv1.EventContext {
ec := &commonv1.EventContext{
EventName: eventName,
CorrelationId: correlationID, // MUST be conversation_id, always explicit
EventId: uuid.NewString(),
EmittedAt: time.Now().UTC().Format(time.RFC3339),
}
if requestCtx != nil {
ec.TenantId = requestCtx.GetTenant()
ec.Subject = requestCtx.GetSubject() // identity flows
ec.OnBehalfOf = requestCtx.GetOnBehalfOf() // delegation flows
ec.SessionId = requestCtx.GetSessionId() // session flows
}
return ec
}
Three things change for downstream consumers:
- FirebaseBridge extracts the conversation ID from
EventContext.subject("agent:conv-abc"→"conv-abc") instead of from the overloadedcaller_key. Same data, unambiguous source. - Webhook routing (Convoy) can authorize event delivery against the
subject— today it sees only the tenant. - Audit pipelines can trace an event back to a typed principal without joining against any other store.
This is the CloudEvents subject pattern: "If identity attributes happen to be part of the event data, the event producer SHOULD also add them to context attributes" — so routing layers can inspect them without deserializing the payload.
Protovalidate Enforcement
Today, zero buf.validate annotations exist on RequestContext or EventContext. The runtime is fully wired — every *_restate_wrappers.go calls w.Validator.Validate(req) before the handler — but there are no rules to enforce. A request with an empty tenant or no ctx at all passes silently.
Enforcement is two layers:
Layer 1 — Field rules on context messages (apis/common/v1/context.proto):
import "buf/validate/validate.proto";
message RequestContext {
string subject = 1 [(buf.validate.field).string.min_len = 1];
string tenant = 2 [(buf.validate.field).string.min_len = 1];
// ...
}
Layer 2 — Required ctx on every request message:
message SendMessageRequest {
.common.v1.RequestContext ctx = 1 [(buf.validate.field).required = true];
// domain fields start at 2
}
Both layers in place, the wrapper rejects:
| Bad input | Wrapper response |
|---|---|
Missing ctx entirely | 400 — "ctx is required" |
Empty tenant | 400 — "ctx.tenant must be at least 1 character" |
Empty subject | 400 — "ctx.subject must be at least 1 character" |
EventContext with empty tenant_id | 400 — "tenant_id must be at least 1 character" |
No Go code changes — the existing Validator.Validate(req) call handles it.
Risk. Phase 0 may surface latent bugs where today's services send empty context. Run the full integration suite before merging. OQ #6 is open on whether to roll Phase 0 out gradually (one domain at a time) or all at once; the PRD recommends all at once with integration test coverage.
Proto Audit & ctx = 1 Standardization
Two conventions exist today: gateways use ctx = 100, inner services use ctx = 1. The refactor standardizes on ctx = 1 everywhere (OQ #5 resolved). Context is the primary input every RPC validates first; it belongs at field 1, not in the system-fields gutter at 100.
| Domain | Today | After |
|---|---|---|
services/gateway | ctx = 100 | ctx = 1 — renumber |
services/storage-gateway | ctx = 100 | ctx = 1 — renumber |
services/notification-gateway | ctx = 100 | ctx = 1 — renumber |
services/webhook-gateway | ctx = 100 | ctx = 1 — renumber |
services/integrations-gateway | ctx = 100 | ctx = 1 — renumber |
services/apikey-gateway | ctx = 100 | ctx = 1 — renumber |
notification (inner) | context = 1 | ctx = 1 — rename to match convention |
notification/platform_service | context = 1 | ctx = 1 — rename |
storage, webhook, schema, conversation actor | ctx = 1 ✓ | unchanged |
Single proto-only PR per service, or one combined PR — no runtime risk because there are no users yet. Renumbering happens in Phase 7, bundled with adding ctx = 1 to RPCs that lack it entirely.
Independent bug. services/gateway/v1/impl.go:387 derives reqCtx on line 377 but does not pass Ctx: reqCtx into ListPendingApprovals. Every other call in the same handler does. Fix in passing — it's a missing-call bug, unrelated to the field-number work.
Rollout Phases
The 9-phase plan from the PRD. Each phase is mostly mechanical — proto edits, one-shot Go renames, regenerate. No new infrastructure to provision.
| Phase | Scope | Status |
|---|---|---|
| 0. Protovalidate enforcement | Add buf.validate annotations to RequestContext + EventContext; add [(buf.validate.field).required = true] on every *Request.ctx. Purely additive — wrappers already call Validate(req). | Not started |
| 1. Add new fields | RequestContext.on_behalf_of (6) and session_id (7); EventContext.subject (7), on_behalf_of (8), and session_id (9). No consumers yet — zero risk. The session_id generation strategy is OQ #8 (deferred to product); the proto field lands in Phase 1 either way and pass-throughs whatever KrakenD injects. | Not started |
| 2. Gateways set typed subjects | All 6 gateways set subject = "user:{id}" or "tenant_member:{id}"; centralize copy-pasted deriveRequestContext() into pkg/tenancy.FromHeaders(). | Not started |
| 3. ConversationActor sets agent subject | Replace CallerKey = conversationId override with subject = "agent:{convId}" + on_behalf_of = original subject. | Not started |
| 4. Eliminate workflow CallerKey overrides | Remove the storage-gateway CallerKey hack (workflows/generation/v1/impl.go:1032). Blocker: OpenFGA enforcement (tenant isolation Layer 4) must accept agent:* subjects first. | Not started — blocked on Authorization PRD |
5. Update event publishing + BuildEventContext | subject, on_behalf_of, session_id flow into EventContext; drop callerKey param; derive Kafka partition key from subject via tenancy.ResourceID(). | Not started |
| 6. FirebaseBridge | Extract conversation ID from EventContext.subject instead of caller_key. | Not started |
7. Standardize ctx = 1 everywhere | Renumber gateway protos (ctx = 100 → ctx = 1); rename notification's context to ctx; add ctx = 1 to RPCs missing it (openfga, github, rstt admin/introspection/deployment, cerebrium, generation GetStateRequest). Proto-only PRs. | Not started |
8. Mark caller_key deprecated, remove | Mark deprecated in both protos; remove from pkg/tenancy/keys.go (BuildActorKey stays — used for SDK client construction); remove gateway references; remove from MCP generated schemas; remove callerKey parameter from BuildEventContext(). | Not started |
Dependency ordering
| Phase depends on | Reason |
|---|---|
| Phase 4 depends on Authorization PRD (OpenFGA) | Storage cannot drop the CallerKey-override hack until OpenFGA can authorize agent:* subjects via relationship checks |
| Phase 3 depends on Phase 1 | ConversationActor needs on_behalf_of to exist before it can populate it |
| Phase 5 depends on Phase 1 | EventContext needs the new identity fields before publishing wires them up |
| Phase 6 depends on Phase 5 | FirebaseBridge consumes what Phase 5 starts publishing |
| Phase 7 depends on Phase 0 | Renumber after enforcement is in place — fewer moving parts |
| Phase 8 depends on Phases 2–7 | Cannot remove caller_key until every consumer has migrated off it |
| Phases 0–2 can run in parallel | Phase 0 is additive; Phase 1 adds unused fields; Phase 2 is gateway-by-gateway |
Open Questions
| # | Question | Status |
|---|---|---|
| 1 | Should on_behalf_of support multi-level delegation chains (agent → agent → user)? | Open — single-level sufficient for now; revisit if Travila staff start acting on a tenant's agent: subjects |
| 2 | EventContext.caller_key follow the same refactor? | Resolved — yes, subject + on_behalf_of (fields 7–8) added; caller_key deprecated alongside RequestContext |
| 3 | CallerKey = tenantId with no user. | Proposed (revisit at implementation) — every Unkey key gains meta.tenant_member_id; KrakenD resolves it into x-user-id; admin RPCs always carry subject = "tenant_member:{x-user-id}"; no service_account: type. Cross-PRD dependency on API Keys PRD. |
| 4 | Should tenancy.ResourceID() parse typed subjects ("agent:conv-abc" → "conv-abc")? | Likely yes — simple prefix strip; needed by Phase 5 Kafka-key derivation |
| 5 | ctx = 100, or is ctx = 1 acceptable for non-gateway services? | Resolved — standardize on ctx = 1 everywhere; aligns with the Zero Trust thesis (context is the primary input, belongs at field 1) |
| 6 | Phase 0 may surface latent bugs in services that send empty context. Roll out gradually or all at once? | Open — recommend all at once with integration test coverage |
| 7 | Should EventContext.subject be validated as required (min_len = 1), or optional during migration? | Open — recommend optional initially, required after Phase 5 ships |
| 8 | session_id be generated? | Deferred to product — not a merge blocker. Proto contract supports any of three implementations (client-generated header, server-derived from auth_time, server-generated + Redis); engineering proceeds with session_id as an optional unvalidated field. Final strategy is a product call driven by Travila's billing model and SDK story. |
| 9 | Should the Publish() fallback for correlation_id (using restate.Key(ctx)) be removed entirely, or kept as a safety net? | Open — recommend removing to force explicit WithCorrelationID() and surface bugs early |
Out of Scope for v0.5
- Multi-level delegation chains (agent → agent → user). Single-level only for now.
service_account:*typed subject. OQ #3 deliberately does not introduce one — every action is attributable to a realtenant_member:*via the API Keys PRD'smeta.tenant_member_idrequirement.- Backfill plan for legacy Unkey keys without
meta.tenant_member_id. Captured in OQ #3 sub-questions; depends on production key inventory at refactor time. RequestContext.projectandEventContext.project_id. Owned by the Projects PRD — fields 8 and 12 respectively. Can ship in parallel from Phase 0 once this refactor's protovalidate posture is in place.- OTel baggage egress stripping. Not a refactor blocker, but
pkg/restatexshould strip non-traceparentOTel headers at network egress per W3C Baggage's "visible to anyone inspecting network traffic" warning.
Cross-References
- RequestContext Refactor PRD — the v0.5 design in full, including 30+ industry citations
- Tenant & Project Lifecycle Roadmap — multi-tenant substrate that ships on top of this refactor
- Edge Idempotency Roadmap — sibling roadmap; the namespace rewrite consumes the typed-subject format
- Authorization PRD — typed-subject model and the Phase 4 OpenFGA blocker
- Identity PRD (Zitadel) — issuer of
tenant_member:*subjects - API Keys PRD —
meta.tenant_member_idrequired for OQ #3 - Google Zanzibar paper — typed-subject foundation
- Netflix Edge Authentication & Token-Agnostic Identity Propagation — Passport precedent
- Google BeyondProd — EUC tickets, hard-reject-on-missing-context posture
- CloudEvents specification — the
subjectattribute pattern - W3C Trace Context —
traceparentpropagation, the source oftrace_id - OWASP Multi-Tenant Security Cheat Sheet — the "establish tenant context early, derive from verified claims, reject otherwise" mandate
- NIST SP 800-207 — Zero Trust Architecture — every-service-validates-independently posture