GCP Eventarc → Kafka Pipeline Roadmap
End-to-end view of how GCP-emitted events become a first-class part of the platform's async substrate — one ingestion pipe (Eventarc → Pub/Sub → Kafka Connect → Kafka), one shared envelope, multi-tenant routing built into the headers, idempotency built into the consumer. This page consolidates the GCP Eventarc → Kafka Pipeline PRD (Draft, 2026-04-24) and its dependencies into a single reference. The design is complete; the 3-phase rollout has not started.
This page is derived from the GCP Eventarc → Kafka Pipeline PRD and its closely related dependencies:
Primary (docs/prd/):
- GCP Eventarc → Kafka Pipeline — the rewritten draft (supersedes the 2026-01-27
gcp-events-pipeline.md); single-topic ingest, multi-tenant envelope, Handler-Internal idempotency, two-surface DLQ
Dependencies:
- Remove Manager Actor Pattern — replaces
StorageManagerActorwith statelessStorageService; the day-1 consumer of this pipeline. NFR-5 (consumer-side tenant isolation) is reused here verbatim. - Projects — establishes the
tenants/{tid}/projects/{pid}/...path convention every event must carry astenant_id+project_idheaders - EndUserService (Identity Capture) — explains why Firestore
user_settingsevents are out of scope (no consumer;EndUserServiceowns user state directly) - Identity Platform — Zitadel — explains why Firebase Auth events are out of scope (Zitadel emits its own lifecycle events)
Related:
- Edge Idempotency Key — the consumer side reuses the Handler-Internal pattern. This pipeline is the first active caller of that pattern (see Idempotency Roadmap)
- Scheduled Jobs — explains why Cloud Scheduler events are out of scope (scheduler dispatches are synchronous HTTP, not async Kafka)
- Notification Gateway — why
device_tokensFirestore events are not plumbed (NotificationService registers tokens directly with Novu) - Firestore ADR — projection-store context behind the source-of-truth split
The whole design refuses to provision per-source infrastructure. Every GCP source the platform ever cares about lands on the same Pub/Sub topic, the same Kafka Connect connector, the same Kafka topic — and consumers discriminate on the CloudEvents ce-type header in-process:
- Single shared topic.
gcp.v1.events.eventcarries every GCP event. Topic name is auto-derived byprotoc-gen-kafkafrom the FQN of the envelope message (gcp.v1.events.Event). 12 partitions, 7-day retention — sized for ≤ 50 events/sec at day 1 with ~6× headroom for the future-sources catalog. - CloudEvents-native filtering.
ce-type(google.cloud.storage.object.v1.finalized,google.cloud.cloudbuild.build.v1.statusCompleted, etc.) is preserved as a Kafka header by the Pub/Sub Source connector. Consumers subscribe to the topic andswitchon the type. No ksqlDB, no Kafka Streams, no enrichment job. - Multi-tenant routing in the envelope, not the body. A small Single Message Transform on the connector parses the
tenants/{tid}/projects/{pid}/...prefix fromce-subjectand emitsx-tenant-id+x-project-idKafka headers. Consumers MUST treat these headers as authoritative — never re-derive tenancy from the payload. - Adding a source is a fixed recipe. New Eventarc trigger + new typed payload proto + new
casein the consumer's switch. No new Pub/Sub topic, no new Kafka topic, no new connector instance, no new partitions. The recipe fits in a single PR with no infrastructure approval.
Canonical reference: GCP Eventarc → Kafka Pipeline PRD.
Glossary
| Term | Definition |
|---|---|
| Eventarc trigger | One per GCP source instance (one bucket, one Firestore database, one IAM scope). Subscribes to a CloudEvent stream and publishes to the shared gcp-events Pub/Sub topic. |
Pub/Sub topic gcp-events | Single shared durable buffer for all GCP events. One topic, one subscription, one DLQ — provisioned once, never grows with the source count. |
Kafka topic gcp.v1.events.event | Single shared Kafka topic carrying every GCP event. Name derived by protoc-gen-kafka from the envelope FQN. Partition key = ce-subject. |
| Envelope | The gcp.v1.events.Event proto — CloudEvents 1.0 attributes plus the platform's required tenant_id / project_id and a google.protobuf.Any payload. The only message in the package with the (kafka.options.v1.topic) annotation. |
| Typed payload | A per-source proto (e.g., GCSObjectFinalized, GCSObjectDeleted, future CloudBuildCompleted) carried inside Event.payload as Any. No oneof — Any keeps the envelope stable as new sources are added. |
ce-type header | CloudEvents type attribute (google.cloud.storage.object.v1.finalized). The canonical discriminator — consumers switch on it. |
x-tenant-id / x-project-id headers | Platform-required Kafka headers populated by the connector's SMT from the tenants/{tid}/projects/{pid}/... prefix in ce-subject. Authoritative for routing — handlers MUST NOT re-derive from payload. |
_platform sentinel | Reserved value of tenant_id / project_id for organization-wide events (IAM audit logs, billing notifications, GKE control-plane). Underscore prefix cannot collide with real tenant/project IDs. |
| Two-surface DLQ | Independent failure modes with separate dead-letter topics: Surface A = Pub/Sub-side delivery failure (gcp-events-dlq Pub/Sub topic), Surface B = connector/consumer-side processing failure (gcp.v1.events.event-dlq Kafka topic). Not automatically mirrored — both must be monitored. |
| Handler-Internal dedup | The Edge Idempotency PRD pattern where the consumer derives a dedup token from a trusted server-side signal. Here that signal is ce.id — sender-generated by GCP, unique per delivery, stable across retries. Passed to restate.WithIdempotencyKey("gcp-event:" + event.Id) on every effecting downstream call. |
last_event_id | Belt-and-braces dedup field on the file Firestore document. The upsert is a transactional read-then-write that no-ops if last_event_id == event.id. Catches Pub/Sub redeliveries that arrive after Restate's idempotency cache has expired. |
What This Replaces
The pipeline rewrite is driven by three platform changes since the original 2026-01-27 PRD landed.
| Change | Effect on the original design |
|---|---|
Manager actors are being removed (Remove Manager Actor PRD, 2026-04-22). StorageManagerActor → stateless StorageService backed by Firestore. | The original storage-event consumer, written against the actor's EXCLUSIVE handlers and Restate K/V state, no longer applies. New consumer is a stateless service handler — concurrent uploads to the same user run in parallel. |
Projects exist (Projects PRD, April 2026). Tenant data is project-scoped: tenants/{tid}/projects/{pid}/.... | Every event must carry both tenant_id and project_id. Path-prefix convention is canonical for path-based subjects (Cloud Storage object names, Firestore document paths, BigQuery table IDs). |
End-user identity is owned by EndUserService (EndUserService PRD, April 2026). SocayoUserActor is gone. | Firestore user_settings events flowing through Kafka would land nowhere. Removed from scope; EndUserService owns user state, Firestore is the source of truth, no event round-trip needed. |
Three categories of events appeared in the 2026-04-10 draft and are out of scope here:
| Removed | Where it lives now |
|---|---|
| Firebase Auth events | Identity Platform — Zitadel — Zitadel emits its own lifecycle events; tenant member auth no longer routes through Firebase. |
| Cloud Scheduler events | Scheduled Jobs PRD — scheduler dispatches are synchronous HTTP through KrakenD, not async Kafka. |
| Firestore document events | No active consumer. device_tokens is owned by NotificationService (registered with Novu directly). user_settings is owned by EndUserService (Firestore is source of truth, written from the gateway upsert). shared_plans has no event-driven consumer planned. Eventarc supports Firestore natively if a future need appears — one-line addition, no need to plumb in advance. |
The Pipeline
The narrow waist is the single shared topic. Adding a new GCP source is a fixed recipe: declare an Eventarc trigger, add a typed payload proto, add a case to the consumer's switch. No new Pub/Sub topic, no new Kafka topic, no new connector instance, no new partitions.
Why Pub/Sub Between Eventarc and Kafka Connect
Eventarc supports Pub/Sub destinations natively, and Pub/Sub gives the pipeline three properties for free:
- Backpressure handling. Ack-deadline (60s), retry policy, DLQ — provided by Pub/Sub without writing code.
- Connector outage tolerance. Kafka Connect can be redeployed, upgraded, or sit behind a broker outage without losing events. They accumulate in the subscription.
- Replay. If the Kafka topic is rebuilt or repartitioned, Kafka Connect resumes from a fresh subscription cursor without involving the GCP source.
Why a Single Kafka Topic
| Property | Single shared topic | Per-type topics |
|---|---|---|
| Provisioning cost | 1 topic, 1 subscription, 1 connector — provisioned once | N topics, N connectors, growing with source count |
| Per-source PR scope | 1 trigger + 1 proto + 1 case | + topic config + connector config + Terraform module |
| Filtering | In-process switch on ce-type header | Native, per-topic |
| Stream processing required | None | None |
| Splittable later? | Yes — promote any one source to its own topic without touching others; envelope unchanged | N/A |
The single-topic decision survives the rewrite. CloudEvents ce-type filtering is the canonical discriminator pattern for the spec; the cost of the in-process switch is negligible at platform-event volumes (storage finalize/delete dominates day 1; future agent sources are even lower volume).
Multi-Tenant Addressing
Every event carries both tenant_id and project_id. Hard requirement — the consumer side has no fallback path. An event without these fields is discarded to the DLQ.
Path-Prefix Convention
For sources whose subject is a path (Cloud Storage object names, Firestore document paths, BigQuery table IDs), the prefix is canonical:
tenants/{tenant_id}/projects/{project_id}/{source-specific-suffix}
| Source | Subject |
|---|---|
| Cloud Storage | tenants/acme/projects/healthcoach-prod/users/u_42/profile.jpg |
| Cloud Storage | tenants/acme/projects/fitness-prod/conversations/c_99/audio.webm |
| Firestore (future) | tenants/acme/projects/healthcoach-prod/end_users/u_42/sessions/s_1 |
pkg/tenancy Path Helpers
Two helpers are added to pkg/tenancy alongside the existing actor-key helpers (BuildActorKey / ParseActorKey):
// PathFor returns "tenants/{tenant_id}/projects/{project_id}/{suffix}".
// Validates that tenant_id and project_id are non-empty and not the
// reserved "_platform" sentinel.
func PathFor(tenantID, projectID, suffix string) string
// ParsePath parses a path produced by PathFor (or written by uploaders
// that follow the convention) and returns its components, or an error
// if the prefix is malformed.
func ParsePath(path string) (tenantID, projectID, suffix string, err error)
Centralizing prefix construction keeps consumers from drifting and makes a future refactor (e.g., adding a region segment) a one-diff change.
The _platform Sentinel
Some GCP events are organization-wide and don't belong to a tenant — IAM audit logs, billing notifications, project-level GKE control-plane events. These flow through the pipeline with tenant_id = "_platform", project_id = "_platform". Reserved value, rejected as a real ID elsewhere in the platform. Tenant-scoped consumers ignore them naturally.
Today, _platform is recognized only inside the Edge Idempotency Key PRD's key-namespacing layer. Treating it as a valid tenant_id / project_id value end-to-end requires updates to pkg/tenancy validators, Firestore security rules, and the OpenFGA model — none of which currently whitelist a _-prefixed ID. Out of scope for this PRD; will land with the first _platform-scoped consumer (DevOps or Security agent) under its own PRD. No day-1 source emits _platform-scoped events, so this caveat does not block the day-1 rollout.
Header Propagation
Eventarc → Pub/Sub uses CloudEvents natively. The Pub/Sub Source connector preserves CloudEvent attributes as Kafka record headers (ce-id, ce-type, ce-source, ce-subject, ce-time). On top of those, a small Single Message Transform — living in integrations/kafka-connect/transforms/, unit-tested independently — extracts tenant_id / project_id from the subject's path prefix and emits them as x-tenant-id / x-project-id headers.
For sources whose subject is not a path (e.g., Eventarc resources keyed by project number or bucket name), the trigger's filter restricts it to one (tenant, project) at trigger creation time, and the headers are populated from trigger metadata via a small Cloud Function adapter sitting between Eventarc and Pub/Sub. This is the rare case, not the default.
Tenant Isolation at the Consumer
Consumers MUST treat the x-tenant-id and x-project-id headers as authoritative. They MUST NOT re-derive tenancy from event payload fields. This matches Remove Manager Actor Pattern PRD's NFR-5 on consumer-side tenant isolation: the source of truth for routing is the envelope, not the body.
Event Schema
Envelope
syntax = "proto3";
package gcp.v1.events;
import "google/protobuf/any.proto";
import "google/protobuf/timestamp.proto";
import "kafka/options/v1/options.proto";
option go_package = "github.com/bo-socayo/workflows-v7/apis/gcp/v1/events;gcpeventsv1";
// Event is the Kafka envelope for every event ingested through the
// GCP Eventarc pipeline. CloudEvents 1.0 plus tenant_id and project_id.
message Event {
option (kafka.options.v1.topic) = true;
// CloudEvents 1.0 attributes
string spec_version = 1; // "1.0"
string id = 2; // Sender-generated unique ID; used for handler-internal dedup
string type = 3; // "google.cloud.storage.object.v1.finalized"
string source = 4;
string subject = 5;
google.protobuf.Timestamp time = 6;
// Platform-required routing fields, populated by the connector
string tenant_id = 10; // From subject path or trigger metadata; "_platform" for org-wide events
string project_id = 11; // Same provenance as tenant_id
// Source-specific typed payload — see Payloads
google.protobuf.Any payload = 20;
}
Why Any and not oneof. Platform convention avoids oneof in new protos (same precedent as ContentPart). Any keeps the envelope stable as new sources are added — adding a new payload type does not change Event. The trade-off (consumers must payload.UnmarshalTo(&typedMsg)) is acceptable because consumers already discriminate on type first.
Day-1 Payloads (Cloud Storage)
// apis/gcp/v1/events/storage.proto
message GCSObjectFinalized {
string bucket = 1;
string object_path = 2; // Full path inside the bucket
string content_type = 3;
int64 size_bytes = 4;
string md5_hash = 5;
google.protobuf.Timestamp created_at = 6;
map<string, string> object_metadata = 7; // GCS object metadata (custom-headers)
}
message GCSObjectDeleted {
string bucket = 1;
string object_path = 2;
google.protobuf.Timestamp deleted_at = 3;
}
No Per-Source Kafka Topics
A previous draft suggested per-typed-payload topic annotations (storage.v1.object-created, etc.). Removed. All payloads ride on the single gcp.v1.events.event topic; typed protos exist only as Any-unwrapped structures consumers decode after dispatch.
Component Inventory
New / Extended Components
| Component | Purpose | PRD section |
|---|---|---|
Pub/Sub topic gcp-events | Single shared durable buffer for all GCP events; one subscription gcp-events-kafka-connect consumed by the connector | §3.1 |
Pub/Sub topic gcp-events-dlq | Surface-A DLQ — Kafka Connect can't pull or ack | §8.4 |
| Kafka Connect Pub/Sub Source connector | Pulls from Pub/Sub, writes to gcp.v1.events.event. Preserves CloudEvent attributes as Kafka headers; SMT extracts tenant/project | §3.1, §11.2 |
Kafka topic gcp.v1.events.event | Single shared topic, 12 partitions, 7-day retention. Name auto-derived by protoc-gen-kafka from the envelope FQN | §11.1 |
Kafka topic gcp.v1.events.event-dlq | Surface-B DLQ — connector or consumer can't process | §8.4, §11.1 |
apis/gcp/v1/events/event.proto | The shared Event envelope; (kafka.options.v1.topic) annotation is here, on the envelope only | §5.1 |
apis/gcp/v1/events/storage.proto | Day-1 typed payloads (GCSObjectFinalized, GCSObjectDeleted) | §5.2 |
pkg/tenancy.PathFor / ParsePath | Centralized path-prefix construction and parsing for tenants/{tid}/projects/{pid}/... | §4.1 |
integrations/kafka-connect/transforms/ | Single Message Transform that parses tenants/{}/projects/{} from ce-subject into x-tenant-id / x-project-id headers | §4.3, §11.2 |
StorageService.ReceiveEvent (subscribed handler) | Day-1 consumer; filters for storage event types and dispatches to typed handlers | §7 |
Eventarc trigger Terraform module (infrastructure/terraform/modules/gcp-events-trigger-gcs/) | Reusable per-bucket trigger module | §12 |
Pipeline Terraform module (infrastructure/terraform/modules/gcp-events-pipeline/) | Pipe-level resources (Pub/Sub, subscription, DLQ, connector, IAM) — instantiated once per env | §12 |
Integrated Components (No Changes Required)
| Component | Role |
|---|---|
| Restate Go SDK | restate.WithIdempotencyKey(...) is the dedup primitive for Handler-Internal calls. No SDK changes. |
pkg/tenancy (existing) | Already carries BuildActorKey / ParseActorKey; the path helpers fit alongside them. |
protoc-gen-kafka | Auto-derives the Kafka topic name from the envelope's lowercased proto FQN. The envelope is the only message in gcp.v1.events with the topic annotation — no changes to the plugin. |
pkg/telemetry | Existing audit/observability path. The original draft's bespoke AuditService consumer is removed — telemetry uses the standard integration. |
| Lago metering | storage_gb usage events flow through the existing Lago path. Envelope's tenant_id is the dimension key. |
Observability Additions
Metrics map to the two failure surfaces in §8.4. Each surface has its own alert.
| Metric | Source | Surface | Alert threshold |
|---|---|---|---|
pubsub_subscription_num_undelivered_messages (subscription = gcp-events-kafka-connect) | Pub/Sub | A | > 1000 for 5 min |
pubsub_subscription_oldest_unacked_message_age | Pub/Sub | A | > 300s |
pubsub_dlq_message_count (custom, on gcp-events-dlq Pub/Sub topic) | Pub/Sub | A | > 0 for 15 min |
kafka_connect_source_record_poll_total | Kafka Connect | — | observed only |
kafka_consumer_group_lag (group = storage-service) | Kafka | B | > 1000 for 5 min |
gcp_events_dlq_message_count (custom, on gcp.v1.events.event-dlq Kafka topic) | Kafka exporter | B | > 0 for 15 min |
Dashboards: GCP Events Pipeline (Pub/Sub backlog, connector record rate, consumer-group lag), GCP Events by Type (record count + DLQ rate per ce-type), GCP Events by Tenant/Project (record count by x-tenant-id × x-project-id — useful when a tenant onboards or a project's traffic spikes).
Day-1 Source: Cloud Storage
Multi-Bucket From Day One
The pipeline is bucket-agnostic. Day 1 uses one bucket (travila-uploads), but the architecture supports any number without code changes:
- The Eventarc trigger is a parameterized Terraform module — adding bucket #2 is a one-call instantiation.
StorageServicereads the bucket name fromGCSObjectFinalized.bucket, never from a constant.- Path conventions inside a bucket follow §6.3 by default; a bucket can opt out (e.g., a future
travila-public-assetsbucket holding tenant-agnostic platform assets).
Bucket Registry
| Bucket | Purpose | Path convention | Day 1? |
|---|---|---|---|
travila-uploads | All end-user file uploads (profile photos, audio, documents, exports) | tenants/{tid}/projects/{pid}/users/{uid}/{file_id}.{ext} | Yes |
(future) travila-generated-content | AI-generated images and reports | Same as above | No |
(future) travila-public-assets | Platform-wide static assets (no tenant scope) | assets/{asset_id}.{ext} | No |
Adding a new bucket: Terraform module instantiation + an entry in this table + (if path convention differs) consumer-side handling.
Event Types
| Event | CloudEvent type | Trigger condition |
|---|---|---|
| Object finalized | google.cloud.storage.object.v1.finalized | Upload complete (initial PUT or compose finished) |
| Object deleted | google.cloud.storage.object.v1.deleted | Object removed via API or lifecycle rule |
object.metadataUpdated is excluded — the platform does not mutate object metadata after upload.
Sample Event (After Connector Processing)
Kafka record on gcp.v1.events.event
Headers:
ce-id: 13284820103245192
ce-type: google.cloud.storage.object.v1.finalized
ce-source: //storage.googleapis.com/projects/_/buckets/travila-uploads
ce-subject: objects/tenants/acme/projects/hcprod/users/u_42/file_8a91.jpg
ce-time: 2026-04-24T12:00:00.000Z
x-tenant-id: acme
x-project-id: hcprod
Body:
Event {
id: "13284820103245192"
type: "google.cloud.storage.object.v1.finalized"
tenant_id: "acme"
project_id: "hcprod"
payload: Any { GCSObjectFinalized { bucket, object_path, … } }
}
Day-1 Consumer: StorageService
StorageService (the stateless Restate service from Remove Manager Actor Pattern PRD §FR-1.2) gets one new Kafka-subscribed handler.
service StorageService {
// Existing CRUD handlers unchanged.
// Reconciles a GCS event against Firestore + Lago metering.
// Subscribed to gcp.v1.events.event; filters for storage event types.
rpc ReceiveEvent(gcp.v1.events.Event) returns (google.protobuf.Empty) {
option (kafka.options.v1.subscribe) = true;
}
}
No EXCLUSIVE handler. StorageService is stateless; concurrent uploads to the same user run in parallel, bounded only by GCS itself.
ReceiveEvent(event)
1. Validate envelope: event.tenant_id and event.project_id must be set
(else TerminalError → DLQ via Restate's default policy)
2. Switch on event.type:
google.cloud.storage.object.v1.finalized → handleObjectFinalized
google.cloud.storage.object.v1.deleted → handleObjectDeleted
(default — not a storage event) → return nil (ignore)
handleObjectFinalized(event, payload)
1. ParsePath(payload.object_path) → (tenant, project, suffix)
Reject (DLQ) on parse failure.
2. Cross-check parsed tenant/project against envelope tenant_id/project_id.
Discrepancy → TerminalError → DLQ.
3. Upsert Firestore file doc at tenants/{tid}/projects/{pid}/users/{uid}/files/{file_id}
with size_bytes, content_type, md5_hash, storage_ref, last_event_id = event.id.
Pass restate.WithIdempotencyKey("gcp-event:" + event.Id) on the call.
4. Emit storage_gb usage event to Kafka via the existing Lago metering path.
Same idempotency key.
handleObjectDeleted(event, payload)
1. ParsePath as above.
2. Delete Firestore file doc (idempotent — already-absent is fine).
3. Emit storage_gb decrement usage event.
What the Old Manager Actor Did That This Doesn't
| Old behavior | Now |
|---|---|
| Wrote into the actor's K/V file tree | Writes a Firestore document; tree is queried via Firestore subcollection reads |
| EXCLUSIVE lock per user serialized concurrent uploads | No lock; concurrent uploads run in parallel |
| Quota enforcement via in-actor counter | Quota enforcement via Lago + Redis (Remove Manager Actor Pattern §FR-1.3) |
| Thumbnail generation inline | Out of scope here; if added later, a separate handler subscribes to the same event and is observed independently |
Idempotency: Handler-Internal via ce.id
Pub/Sub is at-least-once. The Pub/Sub Source connector inherits that. Consumers will see the same event more than once during retries, broker rebalances, or connector restarts. The platform's Edge Idempotency Key PRD addresses this for HTTP edges; Kafka consumers need the equivalent.
The Pattern
Per the Edge Idempotency PRD's Handler-Internal pattern, the consumer derives a dedup token from a trusted server-side signal. CloudEvents ce.id is exactly that — sender-generated by GCP, unique per delivery, stable across retries. The consumer passes it to restate.WithIdempotencyKey(...) on every downstream invocation that has effects we care about deduping:
// Inside StorageService.handleObjectFinalized:
fileDoc := buildFileDoc(event, payload)
_, err := s.firestoreClient.UpsertFile(
ctx,
fileDoc,
restate.WithIdempotencyKey("gcp-event:"+event.Id),
)
if err != nil {
return err
}
return s.lagoMetering.RecordStorageUsage(
ctx,
event.TenantId,
payload.SizeBytes,
restate.WithIdempotencyKey("gcp-event:"+event.Id),
)
The gcp-event: prefix namespaces the key against unrelated callers (per Edge Idempotency PRD §Cache Isolation). Restate's native dedup mechanism handles the rest.
The Edge Idempotency Key PRD v1.1 currently states that Handler-Internal has no active callers — Asynq scheduler became Client-Generated, leaving the pattern dormant as a template. The StorageService event consumer specified here is the first active caller. When this PRD ships, the Edge Idempotency PRD's "no active callers" caveat needs to be updated to point at this consumer.
Belt and Braces: last_event_id on the File Document
In addition to per-call WithIdempotencyKey, the file Firestore document carries last_event_id. The upsert is a transactional read-then-write that no-ops if last_event_id == event.id already. This catches the case where Restate's idempotency cache has expired (default 24h) but a Pub/Sub redelivery still arrives.
Out-of-Order Events
Pub/Sub does not guarantee order, but the Storage event volume per object is low (one finalized + at most one deleted). The consumer treats both as commutative: deleted-then-finalized leaves the document present; finalized-then-deleted leaves it absent. If a future source needs ordering (e.g., Cloud Build with multiple status updates), it adopts a per-resource sequence number in its typed payload and the consumer compares against last_seq on the target document.
Two-Surface DLQ
The pipeline has two independent failure surfaces, each with its own DLQ. They are not the same flow and are not automatically mirrored — oncall must monitor both.
| Surface | Where the failure happens | DLQ topic | Detection metric | Typical causes |
|---|---|---|---|---|
| A — Pub/Sub-side delivery | Kafka Connect is unreachable, stuck, or its credentials are revoked. Pub/Sub retries up to 5 times, then routes to a Pub/Sub DLQ. | gcp-events-dlq (Pub/Sub, not Kafka) | pubsub_subscription_num_undelivered_messages, pubsub_subscription_oldest_unacked_message_age, pubsub_dlq_message_count | Connector outage, IAM credential rotation, Kafka broker partition |
| B — Connector / consumer-side processing | Kafka Connect successfully pulled the message but failed to publish to Kafka — bad payload encoding, SMT failure parsing the prefix, or transient broker write error. | gcp.v1.events.event-dlq (Kafka) | gcp_events_dlq_message_count, kafka_consumer_group_lag | Malformed subject path, missing tenant/project headers, payload type mismatch with ce-type |
No automatic mirroring. A failure on Surface A is not visible on Surface B and vice versa. If a future operational need calls for one unified DLQ, the right move is a separate, narrowly-scoped Kafka Connect connector that reads gcp-events-dlq (Pub/Sub) and writes to gcp.v1.events.event-dlq (Kafka). Out of scope for day 1; the two-surface model is sufficient.
Adding a New GCP Source — The Recipe
The whole point of the design. The recipe, in order:
- Pick the CloudEvent type(s) the source emits. Look it up in the Eventarc event reference.
- Define a typed payload proto at
apis/gcp/v1/events/{source_name}.proto. Mirror only the fields the consumer needs; don't replicate the entire CloudEvent body. - Instantiate the Eventarc trigger module in Terraform with the source-specific filter (bucket, document path pattern, project, etc.). Triggers always target
google_pubsub_topic.gcp_events. - Add a
caseto the target consumer'sReceiveEventswitch matching the newce.type. Decodeevent.Payloadinto the typed message; do the work. - Document the source in the Future Sources Catalog, promoting it from "future" to "active" once the consumer ships.
There is no infrastructure step. No new Pub/Sub topic. No new Kafka topic. No new connector instance. No new partitions.
Future Sources Catalog (DevOps Agents)
These are the agent-relevant sources we expect to add. None are in this PRD's implementation scope; they're listed so reviewers can sanity-check that the pipeline shape supports them. For each, the recipe above applies unchanged.
| Source | Eventarc type(s) | Likely consumer | Tenant scope | Notes |
|---|---|---|---|---|
| Cloud Build | google.cloud.cloudbuild.build.v1.{statusChanged,statusFailed,statusCompleted} | DevOpsAgentService (future) | _platform | Build outcomes per repo / per branch. Trigger ID identifies the workflow. |
| Cloud Run | google.cloud.run.revision.v1.{ready,failed} | DevOpsAgentService | _platform | Revision rollout health — useful for "did the deploy succeed?" agent queries. |
| GKE control plane | google.cloud.audit.log.v1.written (filtered to container.googleapis.com) | DevOpsAgentService | _platform | Cluster-level changes (node pool resize, version upgrades). |
| IAM audit log | google.cloud.audit.log.v1.written (filtered to IAM methods) | SecurityAgentService (future) | _platform | High-signal security trail — grants, key creates, policy changes. |
| Cloud Logging severity sink | google.cloud.audit.log.v1.written filtered to severity >= ERROR | OpsAlertConsumer (future) | _platform or per-tenant if log labels carry tenancy | Catch-all for unhandled errors that should reach an agent. |
Security & Compliance
| Requirement | Mechanism |
|---|---|
| Encryption in transit | TLS for Pub/Sub, mTLS between Kafka Connect ↔ Kafka |
| Encryption at rest | GCP default for Pub/Sub; Kafka volumes encrypted at the disk level |
| Retention | 7 days on gcp.v1.events.event; 14 days on gcp.v1.events.event-dlq. Aligns with the platform privacy retention window. |
| Audit trail | Cloud Audit Logs on the Pub/Sub topic; Restate / Kafka logging via pkg/telemetry. No bespoke AuditService consumer — removed in favor of the standard telemetry path. |
| Tenant isolation | x-tenant-id / x-project-id headers are authoritative; consumers MUST NOT re-derive tenancy from the body. Protects against payload-spoofing in the rare case where a misconfigured trigger would otherwise emit a path the consumer would interpret as a different tenant. |
IAM
| Service account | Roles | Scope |
|---|---|---|
eventarc-gcp-events | roles/pubsub.publisher on gcp-events | Eventarc → Pub/Sub |
kafka-connect-gcp | roles/pubsub.subscriber on gcp-events-kafka-connect | Kafka Connect → Pub/Sub |
Rollout Phases
The 3-phase plan from the PRD. Each phase is mostly mechanical — Terraform module instantiation, proto + handler edits, observability config.
| Phase | Scope | Status |
|---|---|---|
| 1. Pipeline infrastructure | Deploy gcp-events-pipeline Terraform module (Pub/Sub topic, subscription, DLQ, IAM, Kafka Connect connector). Verify a synthetic Pub/Sub publish appears in gcp.v1.events.event with headers preserved. Deploy monitoring dashboards and alerts. | Not started |
| 2. Cloud Storage source + StorageService consumer | Instantiate gcp-events-trigger-gcs for travila-uploads. Land StorageService.ReceiveEvent (subscribed handler + handleObjectFinalized / handleObjectDeleted). End-to-end test: upload → Firestore document + Lago metering event. Migrate any pre-existing storage workflows that previously called StorageManagerActor synchronously (covered by Remove Manager Actor Pattern PRD §FR-4, not duplicated here). | Not started |
| 3. First DevOps source (Cloud Build) | Add apis/gcp/v1/events/cloud_build.proto. Instantiate Eventarc trigger for Cloud Build status change events. Land DevOpsAgentService.ReceiveEvent (skeleton consumer that logs build outcomes; agent reasoning out of scope). Validate the recipe works end-to-end with no infra change. Promote Cloud Build from "future" to "active" in the Future Sources Catalog. | Not started |
Dependency Ordering
| Phase depends on | Reason |
|---|---|
| Phase 1 depends on the Projects PRD | Path convention tenants/{tid}/projects/{pid}/... is established by Projects; the SMT in §11.2 parses against that exact prefix |
| Phase 2 depends on Phase 1 | The StorageService consumer needs the Kafka topic and connector running before it can subscribe |
| Phase 2 depends on the Remove Manager Actor Pattern PRD | The day-1 consumer is the new stateless StorageService from that PRD; it doesn't exist as a destination until that PRD ships |
| Phase 3 depends on Phase 2 | Validates that the recipe is what's claimed — second source, no infra change |
Phase 3 is independent of _platform end-to-end support | Cloud Build is _platform-scoped, but the day-1 consumer's logging skeleton doesn't need _platform to be a valid tenant_id everywhere — full _platform support is a follow-up scope per the warning in §4.2 |
Out of Scope (v1)
- Real-time blocking validation. Eventarc is asynchronous; if you need to prevent an action, use a Cloud Function or a synchronous gateway, not this pipeline.
- Bidirectional sync. This pipeline is GCP → Kafka only. Travila → GCP writes go through service-specific code paths (e.g., the
services/storageGCS client wrapper). - Non-GCP event sources. Inbound webhooks from Stripe, GitHub, Twilio, Pipedream go through the existing webhook gateway and the Edge Idempotency Key PRD flow, not this pipeline.
- Firestore document events. No active consumer day 1.
device_tokens,user_settings,shared_plansall have non-event ownership stories. Eventarc supports Firestore natively if a future need appears — one-line addition; the design doesn't need to plumb it in advance. _platform-scoped tenancy end-to-end. Recognized only in the Edge Idempotency PRD's namespacing today;pkg/tenancyvalidators, Firestore security rules, and the OpenFGA model don't whitelist a_-prefixed ID. Lands with the first_platform-scoped consumer's PRD (DevOps or Security agent).- Replicating provider strengths. Not aiming to replicate Firestore's change-stream UX, Pub/Sub filtering, or Kafka stream processing inside this pipeline. The pipeline is the dumb pipe; intelligence lives in the consumer.
- Unified DLQ across both surfaces. The two-surface model is intentional. If operational need eventually surfaces, a small connector reading the Pub/Sub DLQ into the Kafka DLQ is the way; not in scope day 1.
Cross-References
- GCP Eventarc → Kafka Pipeline PRD — the rewritten draft in full, including the appendix detailing what changed from the 2026-01-27 version
- Remove Manager Actor Pattern PRD — the day-1 consumer's home; consumer-side tenant isolation NFR is shared
- Edge Idempotency Roadmap — sibling roadmap; this pipeline is the first active Handler-Internal caller
- Tenant & Project Lifecycle Roadmap — multi-tenant substrate (
tenants/{tid}/projects/{pid}/...) the path convention is built on - RequestContext Refactor Roadmap — typed-subject and protovalidate guarantees that downstream consumers of this pipeline rely on
- Eventarc event reference — the source-of-truth list of CloudEvent types
- CloudEvents specification —
ce-type,ce-subject,ce-idsemantics - Kafka Connect Pub/Sub Source connector — the connector this pipeline uses