Skip to main content

GCP Eventarc → Kafka Pipeline Roadmap

End-to-end view of how GCP-emitted events become a first-class part of the platform's async substrate — one ingestion pipe (Eventarc → Pub/Sub → Kafka Connect → Kafka), one shared envelope, multi-tenant routing built into the headers, idempotency built into the consumer. This page consolidates the GCP Eventarc → Kafka Pipeline PRD (Draft, 2026-04-24) and its dependencies into a single reference. The design is complete; the 3-phase rollout has not started.

Source PRDs

This page is derived from the GCP Eventarc → Kafka Pipeline PRD and its closely related dependencies:

Primary (docs/prd/):

  • GCP Eventarc → Kafka Pipeline — the rewritten draft (supersedes the 2026-01-27 gcp-events-pipeline.md); single-topic ingest, multi-tenant envelope, Handler-Internal idempotency, two-surface DLQ

Dependencies:

  • Remove Manager Actor Pattern — replaces StorageManagerActor with stateless StorageService; the day-1 consumer of this pipeline. NFR-5 (consumer-side tenant isolation) is reused here verbatim.
  • Projects — establishes the tenants/{tid}/projects/{pid}/... path convention every event must carry as tenant_id + project_id headers
  • EndUserService (Identity Capture) — explains why Firestore user_settings events are out of scope (no consumer; EndUserService owns user state directly)
  • Identity Platform — Zitadel — explains why Firebase Auth events are out of scope (Zitadel emits its own lifecycle events)

Related:

  • Edge Idempotency Key — the consumer side reuses the Handler-Internal pattern. This pipeline is the first active caller of that pattern (see Idempotency Roadmap)
  • Scheduled Jobs — explains why Cloud Scheduler events are out of scope (scheduler dispatches are synchronous HTTP, not async Kafka)
  • Notification Gateway — why device_tokens Firestore events are not plumbed (NotificationService registers tokens directly with Novu)
  • Firestore ADR — projection-store context behind the source-of-truth split
Architectural Direction — One Pipe, Many Sources

The whole design refuses to provision per-source infrastructure. Every GCP source the platform ever cares about lands on the same Pub/Sub topic, the same Kafka Connect connector, the same Kafka topic — and consumers discriminate on the CloudEvents ce-type header in-process:

  • Single shared topic. gcp.v1.events.event carries every GCP event. Topic name is auto-derived by protoc-gen-kafka from the FQN of the envelope message (gcp.v1.events.Event). 12 partitions, 7-day retention — sized for ≤ 50 events/sec at day 1 with ~6× headroom for the future-sources catalog.
  • CloudEvents-native filtering. ce-type (google.cloud.storage.object.v1.finalized, google.cloud.cloudbuild.build.v1.statusCompleted, etc.) is preserved as a Kafka header by the Pub/Sub Source connector. Consumers subscribe to the topic and switch on the type. No ksqlDB, no Kafka Streams, no enrichment job.
  • Multi-tenant routing in the envelope, not the body. A small Single Message Transform on the connector parses the tenants/{tid}/projects/{pid}/... prefix from ce-subject and emits x-tenant-id + x-project-id Kafka headers. Consumers MUST treat these headers as authoritative — never re-derive tenancy from the payload.
  • Adding a source is a fixed recipe. New Eventarc trigger + new typed payload proto + new case in the consumer's switch. No new Pub/Sub topic, no new Kafka topic, no new connector instance, no new partitions. The recipe fits in a single PR with no infrastructure approval.

Canonical reference: GCP Eventarc → Kafka Pipeline PRD.

Glossary

TermDefinition
Eventarc triggerOne per GCP source instance (one bucket, one Firestore database, one IAM scope). Subscribes to a CloudEvent stream and publishes to the shared gcp-events Pub/Sub topic.
Pub/Sub topic gcp-eventsSingle shared durable buffer for all GCP events. One topic, one subscription, one DLQ — provisioned once, never grows with the source count.
Kafka topic gcp.v1.events.eventSingle shared Kafka topic carrying every GCP event. Name derived by protoc-gen-kafka from the envelope FQN. Partition key = ce-subject.
EnvelopeThe gcp.v1.events.Event proto — CloudEvents 1.0 attributes plus the platform's required tenant_id / project_id and a google.protobuf.Any payload. The only message in the package with the (kafka.options.v1.topic) annotation.
Typed payloadA per-source proto (e.g., GCSObjectFinalized, GCSObjectDeleted, future CloudBuildCompleted) carried inside Event.payload as Any. No oneofAny keeps the envelope stable as new sources are added.
ce-type headerCloudEvents type attribute (google.cloud.storage.object.v1.finalized). The canonical discriminator — consumers switch on it.
x-tenant-id / x-project-id headersPlatform-required Kafka headers populated by the connector's SMT from the tenants/{tid}/projects/{pid}/... prefix in ce-subject. Authoritative for routing — handlers MUST NOT re-derive from payload.
_platform sentinelReserved value of tenant_id / project_id for organization-wide events (IAM audit logs, billing notifications, GKE control-plane). Underscore prefix cannot collide with real tenant/project IDs.
Two-surface DLQIndependent failure modes with separate dead-letter topics: Surface A = Pub/Sub-side delivery failure (gcp-events-dlq Pub/Sub topic), Surface B = connector/consumer-side processing failure (gcp.v1.events.event-dlq Kafka topic). Not automatically mirrored — both must be monitored.
Handler-Internal dedupThe Edge Idempotency PRD pattern where the consumer derives a dedup token from a trusted server-side signal. Here that signal is ce.id — sender-generated by GCP, unique per delivery, stable across retries. Passed to restate.WithIdempotencyKey("gcp-event:" + event.Id) on every effecting downstream call.
last_event_idBelt-and-braces dedup field on the file Firestore document. The upsert is a transactional read-then-write that no-ops if last_event_id == event.id. Catches Pub/Sub redeliveries that arrive after Restate's idempotency cache has expired.

What This Replaces

The pipeline rewrite is driven by three platform changes since the original 2026-01-27 PRD landed.

ChangeEffect on the original design
Manager actors are being removed (Remove Manager Actor PRD, 2026-04-22). StorageManagerActor → stateless StorageService backed by Firestore.The original storage-event consumer, written against the actor's EXCLUSIVE handlers and Restate K/V state, no longer applies. New consumer is a stateless service handler — concurrent uploads to the same user run in parallel.
Projects exist (Projects PRD, April 2026). Tenant data is project-scoped: tenants/{tid}/projects/{pid}/....Every event must carry both tenant_id and project_id. Path-prefix convention is canonical for path-based subjects (Cloud Storage object names, Firestore document paths, BigQuery table IDs).
End-user identity is owned by EndUserService (EndUserService PRD, April 2026). SocayoUserActor is gone.Firestore user_settings events flowing through Kafka would land nowhere. Removed from scope; EndUserService owns user state, Firestore is the source of truth, no event round-trip needed.

Three categories of events appeared in the 2026-04-10 draft and are out of scope here:

RemovedWhere it lives now
Firebase Auth eventsIdentity Platform — Zitadel — Zitadel emits its own lifecycle events; tenant member auth no longer routes through Firebase.
Cloud Scheduler eventsScheduled Jobs PRD — scheduler dispatches are synchronous HTTP through KrakenD, not async Kafka.
Firestore document eventsNo active consumer. device_tokens is owned by NotificationService (registered with Novu directly). user_settings is owned by EndUserService (Firestore is source of truth, written from the gateway upsert). shared_plans has no event-driven consumer planned. Eventarc supports Firestore natively if a future need appears — one-line addition, no need to plumb in advance.

The Pipeline

Pipeline architecture — GCP sources (Cloud Storage day-1, Cloud Build and IAM future) fan into Eventarc triggers; triggers publish to a single Pub/Sub topic (gcp-events); Kafka Connect Pub/Sub Source connector with an SMT parses tenants/tid/projects/pid/... from ce-subject into x-tenant-id and x-project-id headers; events land on the single shared Kafka topic gcp.v1.events.event (12 partitions, 7-day retention); consumers (StorageService day-1, DevOpsAgent and SecurityAgent future) filter by ce-type header. Pub/Sub DLQ (Surface A) and Kafka DLQ (Surface B) shown to the side

The narrow waist is the single shared topic. Adding a new GCP source is a fixed recipe: declare an Eventarc trigger, add a typed payload proto, add a case to the consumer's switch. No new Pub/Sub topic, no new Kafka topic, no new connector instance, no new partitions.

Why Pub/Sub Between Eventarc and Kafka Connect

Eventarc supports Pub/Sub destinations natively, and Pub/Sub gives the pipeline three properties for free:

  • Backpressure handling. Ack-deadline (60s), retry policy, DLQ — provided by Pub/Sub without writing code.
  • Connector outage tolerance. Kafka Connect can be redeployed, upgraded, or sit behind a broker outage without losing events. They accumulate in the subscription.
  • Replay. If the Kafka topic is rebuilt or repartitioned, Kafka Connect resumes from a fresh subscription cursor without involving the GCP source.

Why a Single Kafka Topic

PropertySingle shared topicPer-type topics
Provisioning cost1 topic, 1 subscription, 1 connector — provisioned onceN topics, N connectors, growing with source count
Per-source PR scope1 trigger + 1 proto + 1 case+ topic config + connector config + Terraform module
FilteringIn-process switch on ce-type headerNative, per-topic
Stream processing requiredNoneNone
Splittable later?Yes — promote any one source to its own topic without touching others; envelope unchangedN/A

The single-topic decision survives the rewrite. CloudEvents ce-type filtering is the canonical discriminator pattern for the spec; the cost of the in-process switch is negligible at platform-event volumes (storage finalize/delete dominates day 1; future agent sources are even lower volume).

Multi-Tenant Addressing

Every event carries both tenant_id and project_id. Hard requirement — the consumer side has no fallback path. An event without these fields is discarded to the DLQ.

Path-Prefix Convention

For sources whose subject is a path (Cloud Storage object names, Firestore document paths, BigQuery table IDs), the prefix is canonical:

tenants/{tenant_id}/projects/{project_id}/{source-specific-suffix}
SourceSubject
Cloud Storagetenants/acme/projects/healthcoach-prod/users/u_42/profile.jpg
Cloud Storagetenants/acme/projects/fitness-prod/conversations/c_99/audio.webm
Firestore (future)tenants/acme/projects/healthcoach-prod/end_users/u_42/sessions/s_1

pkg/tenancy Path Helpers

Two helpers are added to pkg/tenancy alongside the existing actor-key helpers (BuildActorKey / ParseActorKey):

// PathFor returns "tenants/{tenant_id}/projects/{project_id}/{suffix}".
// Validates that tenant_id and project_id are non-empty and not the
// reserved "_platform" sentinel.
func PathFor(tenantID, projectID, suffix string) string

// ParsePath parses a path produced by PathFor (or written by uploaders
// that follow the convention) and returns its components, or an error
// if the prefix is malformed.
func ParsePath(path string) (tenantID, projectID, suffix string, err error)

Centralizing prefix construction keeps consumers from drifting and makes a future refactor (e.g., adding a region segment) a one-diff change.

The _platform Sentinel

Some GCP events are organization-wide and don't belong to a tenant — IAM audit logs, billing notifications, project-level GKE control-plane events. These flow through the pipeline with tenant_id = "_platform", project_id = "_platform". Reserved value, rejected as a real ID elsewhere in the platform. Tenant-scoped consumers ignore them naturally.

Follow-up scope

Today, _platform is recognized only inside the Edge Idempotency Key PRD's key-namespacing layer. Treating it as a valid tenant_id / project_id value end-to-end requires updates to pkg/tenancy validators, Firestore security rules, and the OpenFGA model — none of which currently whitelist a _-prefixed ID. Out of scope for this PRD; will land with the first _platform-scoped consumer (DevOps or Security agent) under its own PRD. No day-1 source emits _platform-scoped events, so this caveat does not block the day-1 rollout.

Header Propagation

Eventarc → Pub/Sub uses CloudEvents natively. The Pub/Sub Source connector preserves CloudEvent attributes as Kafka record headers (ce-id, ce-type, ce-source, ce-subject, ce-time). On top of those, a small Single Message Transform — living in integrations/kafka-connect/transforms/, unit-tested independently — extracts tenant_id / project_id from the subject's path prefix and emits them as x-tenant-id / x-project-id headers.

For sources whose subject is not a path (e.g., Eventarc resources keyed by project number or bucket name), the trigger's filter restricts it to one (tenant, project) at trigger creation time, and the headers are populated from trigger metadata via a small Cloud Function adapter sitting between Eventarc and Pub/Sub. This is the rare case, not the default.

Tenant Isolation at the Consumer

Consumers MUST treat the x-tenant-id and x-project-id headers as authoritative. They MUST NOT re-derive tenancy from event payload fields. This matches Remove Manager Actor Pattern PRD's NFR-5 on consumer-side tenant isolation: the source of truth for routing is the envelope, not the body.

Event Schema

Envelope

syntax = "proto3";

package gcp.v1.events;

import "google/protobuf/any.proto";
import "google/protobuf/timestamp.proto";
import "kafka/options/v1/options.proto";

option go_package = "github.com/bo-socayo/workflows-v7/apis/gcp/v1/events;gcpeventsv1";

// Event is the Kafka envelope for every event ingested through the
// GCP Eventarc pipeline. CloudEvents 1.0 plus tenant_id and project_id.
message Event {
option (kafka.options.v1.topic) = true;

// CloudEvents 1.0 attributes
string spec_version = 1; // "1.0"
string id = 2; // Sender-generated unique ID; used for handler-internal dedup
string type = 3; // "google.cloud.storage.object.v1.finalized"
string source = 4;
string subject = 5;
google.protobuf.Timestamp time = 6;

// Platform-required routing fields, populated by the connector
string tenant_id = 10; // From subject path or trigger metadata; "_platform" for org-wide events
string project_id = 11; // Same provenance as tenant_id

// Source-specific typed payload — see Payloads
google.protobuf.Any payload = 20;
}

Why Any and not oneof. Platform convention avoids oneof in new protos (same precedent as ContentPart). Any keeps the envelope stable as new sources are added — adding a new payload type does not change Event. The trade-off (consumers must payload.UnmarshalTo(&typedMsg)) is acceptable because consumers already discriminate on type first.

Day-1 Payloads (Cloud Storage)

// apis/gcp/v1/events/storage.proto

message GCSObjectFinalized {
string bucket = 1;
string object_path = 2; // Full path inside the bucket
string content_type = 3;
int64 size_bytes = 4;
string md5_hash = 5;
google.protobuf.Timestamp created_at = 6;
map<string, string> object_metadata = 7; // GCS object metadata (custom-headers)
}

message GCSObjectDeleted {
string bucket = 1;
string object_path = 2;
google.protobuf.Timestamp deleted_at = 3;
}

No Per-Source Kafka Topics

A previous draft suggested per-typed-payload topic annotations (storage.v1.object-created, etc.). Removed. All payloads ride on the single gcp.v1.events.event topic; typed protos exist only as Any-unwrapped structures consumers decode after dispatch.

Component Inventory

New / Extended Components

ComponentPurposePRD section
Pub/Sub topic gcp-eventsSingle shared durable buffer for all GCP events; one subscription gcp-events-kafka-connect consumed by the connector§3.1
Pub/Sub topic gcp-events-dlqSurface-A DLQ — Kafka Connect can't pull or ack§8.4
Kafka Connect Pub/Sub Source connectorPulls from Pub/Sub, writes to gcp.v1.events.event. Preserves CloudEvent attributes as Kafka headers; SMT extracts tenant/project§3.1, §11.2
Kafka topic gcp.v1.events.eventSingle shared topic, 12 partitions, 7-day retention. Name auto-derived by protoc-gen-kafka from the envelope FQN§11.1
Kafka topic gcp.v1.events.event-dlqSurface-B DLQ — connector or consumer can't process§8.4, §11.1
apis/gcp/v1/events/event.protoThe shared Event envelope; (kafka.options.v1.topic) annotation is here, on the envelope only§5.1
apis/gcp/v1/events/storage.protoDay-1 typed payloads (GCSObjectFinalized, GCSObjectDeleted)§5.2
pkg/tenancy.PathFor / ParsePathCentralized path-prefix construction and parsing for tenants/{tid}/projects/{pid}/...§4.1
integrations/kafka-connect/transforms/Single Message Transform that parses tenants/{}/projects/{} from ce-subject into x-tenant-id / x-project-id headers§4.3, §11.2
StorageService.ReceiveEvent (subscribed handler)Day-1 consumer; filters for storage event types and dispatches to typed handlers§7
Eventarc trigger Terraform module (infrastructure/terraform/modules/gcp-events-trigger-gcs/)Reusable per-bucket trigger module§12
Pipeline Terraform module (infrastructure/terraform/modules/gcp-events-pipeline/)Pipe-level resources (Pub/Sub, subscription, DLQ, connector, IAM) — instantiated once per env§12

Integrated Components (No Changes Required)

ComponentRole
Restate Go SDKrestate.WithIdempotencyKey(...) is the dedup primitive for Handler-Internal calls. No SDK changes.
pkg/tenancy (existing)Already carries BuildActorKey / ParseActorKey; the path helpers fit alongside them.
protoc-gen-kafkaAuto-derives the Kafka topic name from the envelope's lowercased proto FQN. The envelope is the only message in gcp.v1.events with the topic annotation — no changes to the plugin.
pkg/telemetryExisting audit/observability path. The original draft's bespoke AuditService consumer is removed — telemetry uses the standard integration.
Lago meteringstorage_gb usage events flow through the existing Lago path. Envelope's tenant_id is the dimension key.

Observability Additions

Metrics map to the two failure surfaces in §8.4. Each surface has its own alert.

MetricSourceSurfaceAlert threshold
pubsub_subscription_num_undelivered_messages (subscription = gcp-events-kafka-connect)Pub/SubA> 1000 for 5 min
pubsub_subscription_oldest_unacked_message_agePub/SubA> 300s
pubsub_dlq_message_count (custom, on gcp-events-dlq Pub/Sub topic)Pub/SubA> 0 for 15 min
kafka_connect_source_record_poll_totalKafka Connectobserved only
kafka_consumer_group_lag (group = storage-service)KafkaB> 1000 for 5 min
gcp_events_dlq_message_count (custom, on gcp.v1.events.event-dlq Kafka topic)Kafka exporterB> 0 for 15 min

Dashboards: GCP Events Pipeline (Pub/Sub backlog, connector record rate, consumer-group lag), GCP Events by Type (record count + DLQ rate per ce-type), GCP Events by Tenant/Project (record count by x-tenant-id × x-project-id — useful when a tenant onboards or a project's traffic spikes).

Day-1 Source: Cloud Storage

Multi-Bucket From Day One

The pipeline is bucket-agnostic. Day 1 uses one bucket (travila-uploads), but the architecture supports any number without code changes:

  • The Eventarc trigger is a parameterized Terraform module — adding bucket #2 is a one-call instantiation.
  • StorageService reads the bucket name from GCSObjectFinalized.bucket, never from a constant.
  • Path conventions inside a bucket follow §6.3 by default; a bucket can opt out (e.g., a future travila-public-assets bucket holding tenant-agnostic platform assets).

Bucket Registry

BucketPurposePath conventionDay 1?
travila-uploadsAll end-user file uploads (profile photos, audio, documents, exports)tenants/{tid}/projects/{pid}/users/{uid}/{file_id}.{ext}Yes
(future) travila-generated-contentAI-generated images and reportsSame as aboveNo
(future) travila-public-assetsPlatform-wide static assets (no tenant scope)assets/{asset_id}.{ext}No

Adding a new bucket: Terraform module instantiation + an entry in this table + (if path convention differs) consumer-side handling.

Event Types

EventCloudEvent typeTrigger condition
Object finalizedgoogle.cloud.storage.object.v1.finalizedUpload complete (initial PUT or compose finished)
Object deletedgoogle.cloud.storage.object.v1.deletedObject removed via API or lifecycle rule

object.metadataUpdated is excluded — the platform does not mutate object metadata after upload.

Sample Event (After Connector Processing)

Kafka record on gcp.v1.events.event
Headers:
ce-id: 13284820103245192
ce-type: google.cloud.storage.object.v1.finalized
ce-source: //storage.googleapis.com/projects/_/buckets/travila-uploads
ce-subject: objects/tenants/acme/projects/hcprod/users/u_42/file_8a91.jpg
ce-time: 2026-04-24T12:00:00.000Z
x-tenant-id: acme
x-project-id: hcprod
Body:
Event {
id: "13284820103245192"
type: "google.cloud.storage.object.v1.finalized"
tenant_id: "acme"
project_id: "hcprod"
payload: Any { GCSObjectFinalized { bucket, object_path, … } }
}

Day-1 Consumer: StorageService

StorageService (the stateless Restate service from Remove Manager Actor Pattern PRD §FR-1.2) gets one new Kafka-subscribed handler.

service StorageService {
// Existing CRUD handlers unchanged.

// Reconciles a GCS event against Firestore + Lago metering.
// Subscribed to gcp.v1.events.event; filters for storage event types.
rpc ReceiveEvent(gcp.v1.events.Event) returns (google.protobuf.Empty) {
option (kafka.options.v1.subscribe) = true;
}
}

No EXCLUSIVE handler. StorageService is stateless; concurrent uploads to the same user run in parallel, bounded only by GCS itself.

ReceiveEvent(event)
1. Validate envelope: event.tenant_id and event.project_id must be set
(else TerminalError → DLQ via Restate's default policy)
2. Switch on event.type:
google.cloud.storage.object.v1.finalized → handleObjectFinalized
google.cloud.storage.object.v1.deleted → handleObjectDeleted
(default — not a storage event) → return nil (ignore)

handleObjectFinalized(event, payload)
1. ParsePath(payload.object_path) → (tenant, project, suffix)
Reject (DLQ) on parse failure.
2. Cross-check parsed tenant/project against envelope tenant_id/project_id.
Discrepancy → TerminalError → DLQ.
3. Upsert Firestore file doc at tenants/{tid}/projects/{pid}/users/{uid}/files/{file_id}
with size_bytes, content_type, md5_hash, storage_ref, last_event_id = event.id.
Pass restate.WithIdempotencyKey("gcp-event:" + event.Id) on the call.
4. Emit storage_gb usage event to Kafka via the existing Lago metering path.
Same idempotency key.

handleObjectDeleted(event, payload)
1. ParsePath as above.
2. Delete Firestore file doc (idempotent — already-absent is fine).
3. Emit storage_gb decrement usage event.

What the Old Manager Actor Did That This Doesn't

Old behaviorNow
Wrote into the actor's K/V file treeWrites a Firestore document; tree is queried via Firestore subcollection reads
EXCLUSIVE lock per user serialized concurrent uploadsNo lock; concurrent uploads run in parallel
Quota enforcement via in-actor counterQuota enforcement via Lago + Redis (Remove Manager Actor Pattern §FR-1.3)
Thumbnail generation inlineOut of scope here; if added later, a separate handler subscribes to the same event and is observed independently

Idempotency: Handler-Internal via ce.id

Pub/Sub is at-least-once. The Pub/Sub Source connector inherits that. Consumers will see the same event more than once during retries, broker rebalances, or connector restarts. The platform's Edge Idempotency Key PRD addresses this for HTTP edges; Kafka consumers need the equivalent.

The Pattern

Per the Edge Idempotency PRD's Handler-Internal pattern, the consumer derives a dedup token from a trusted server-side signal. CloudEvents ce.id is exactly that — sender-generated by GCP, unique per delivery, stable across retries. The consumer passes it to restate.WithIdempotencyKey(...) on every downstream invocation that has effects we care about deduping:

// Inside StorageService.handleObjectFinalized:
fileDoc := buildFileDoc(event, payload)

_, err := s.firestoreClient.UpsertFile(
ctx,
fileDoc,
restate.WithIdempotencyKey("gcp-event:"+event.Id),
)
if err != nil {
return err
}

return s.lagoMetering.RecordStorageUsage(
ctx,
event.TenantId,
payload.SizeBytes,
restate.WithIdempotencyKey("gcp-event:"+event.Id),
)

The gcp-event: prefix namespaces the key against unrelated callers (per Edge Idempotency PRD §Cache Isolation). Restate's native dedup mechanism handles the rest.

Cross-PRD State

The Edge Idempotency Key PRD v1.1 currently states that Handler-Internal has no active callers — Asynq scheduler became Client-Generated, leaving the pattern dormant as a template. The StorageService event consumer specified here is the first active caller. When this PRD ships, the Edge Idempotency PRD's "no active callers" caveat needs to be updated to point at this consumer.

Belt and Braces: last_event_id on the File Document

In addition to per-call WithIdempotencyKey, the file Firestore document carries last_event_id. The upsert is a transactional read-then-write that no-ops if last_event_id == event.id already. This catches the case where Restate's idempotency cache has expired (default 24h) but a Pub/Sub redelivery still arrives.

Out-of-Order Events

Pub/Sub does not guarantee order, but the Storage event volume per object is low (one finalized + at most one deleted). The consumer treats both as commutative: deleted-then-finalized leaves the document present; finalized-then-deleted leaves it absent. If a future source needs ordering (e.g., Cloud Build with multiple status updates), it adopts a per-resource sequence number in its typed payload and the consumer compares against last_seq on the target document.

Two-Surface DLQ

The pipeline has two independent failure surfaces, each with its own DLQ. They are not the same flow and are not automatically mirrored — oncall must monitor both.

Two-surface DLQ — happy-path row Eventarc → Pub/Sub topic → Kafka Connect (with SMT) → Kafka topic → Consumer; Surface A DLQ (Pub/Sub) branches off after the Pub/Sub topic when Kafka Connect can&#39;t pull or ack; Surface B DLQ (Kafka) branches off after Kafka Connect when SMT or encoding fails before the message reaches Kafka; detection metrics shown beneath each DLQ; footer warning that both surfaces must be monitored independently with different tooling

SurfaceWhere the failure happensDLQ topicDetection metricTypical causes
A — Pub/Sub-side deliveryKafka Connect is unreachable, stuck, or its credentials are revoked. Pub/Sub retries up to 5 times, then routes to a Pub/Sub DLQ.gcp-events-dlq (Pub/Sub, not Kafka)pubsub_subscription_num_undelivered_messages, pubsub_subscription_oldest_unacked_message_age, pubsub_dlq_message_countConnector outage, IAM credential rotation, Kafka broker partition
B — Connector / consumer-side processingKafka Connect successfully pulled the message but failed to publish to Kafka — bad payload encoding, SMT failure parsing the prefix, or transient broker write error.gcp.v1.events.event-dlq (Kafka)gcp_events_dlq_message_count, kafka_consumer_group_lagMalformed subject path, missing tenant/project headers, payload type mismatch with ce-type

No automatic mirroring. A failure on Surface A is not visible on Surface B and vice versa. If a future operational need calls for one unified DLQ, the right move is a separate, narrowly-scoped Kafka Connect connector that reads gcp-events-dlq (Pub/Sub) and writes to gcp.v1.events.event-dlq (Kafka). Out of scope for day 1; the two-surface model is sufficient.

Adding a New GCP Source — The Recipe

The whole point of the design. The recipe, in order:

  1. Pick the CloudEvent type(s) the source emits. Look it up in the Eventarc event reference.
  2. Define a typed payload proto at apis/gcp/v1/events/{source_name}.proto. Mirror only the fields the consumer needs; don't replicate the entire CloudEvent body.
  3. Instantiate the Eventarc trigger module in Terraform with the source-specific filter (bucket, document path pattern, project, etc.). Triggers always target google_pubsub_topic.gcp_events.
  4. Add a case to the target consumer's ReceiveEvent switch matching the new ce.type. Decode event.Payload into the typed message; do the work.
  5. Document the source in the Future Sources Catalog, promoting it from "future" to "active" once the consumer ships.

There is no infrastructure step. No new Pub/Sub topic. No new Kafka topic. No new connector instance. No new partitions.

Future Sources Catalog (DevOps Agents)

These are the agent-relevant sources we expect to add. None are in this PRD's implementation scope; they're listed so reviewers can sanity-check that the pipeline shape supports them. For each, the recipe above applies unchanged.

SourceEventarc type(s)Likely consumerTenant scopeNotes
Cloud Buildgoogle.cloud.cloudbuild.build.v1.{statusChanged,statusFailed,statusCompleted}DevOpsAgentService (future)_platformBuild outcomes per repo / per branch. Trigger ID identifies the workflow.
Cloud Rungoogle.cloud.run.revision.v1.{ready,failed}DevOpsAgentService_platformRevision rollout health — useful for "did the deploy succeed?" agent queries.
GKE control planegoogle.cloud.audit.log.v1.written (filtered to container.googleapis.com)DevOpsAgentService_platformCluster-level changes (node pool resize, version upgrades).
IAM audit loggoogle.cloud.audit.log.v1.written (filtered to IAM methods)SecurityAgentService (future)_platformHigh-signal security trail — grants, key creates, policy changes.
Cloud Logging severity sinkgoogle.cloud.audit.log.v1.written filtered to severity >= ERROROpsAlertConsumer (future)_platform or per-tenant if log labels carry tenancyCatch-all for unhandled errors that should reach an agent.

Security & Compliance

RequirementMechanism
Encryption in transitTLS for Pub/Sub, mTLS between Kafka Connect ↔ Kafka
Encryption at restGCP default for Pub/Sub; Kafka volumes encrypted at the disk level
Retention7 days on gcp.v1.events.event; 14 days on gcp.v1.events.event-dlq. Aligns with the platform privacy retention window.
Audit trailCloud Audit Logs on the Pub/Sub topic; Restate / Kafka logging via pkg/telemetry. No bespoke AuditService consumer — removed in favor of the standard telemetry path.
Tenant isolationx-tenant-id / x-project-id headers are authoritative; consumers MUST NOT re-derive tenancy from the body. Protects against payload-spoofing in the rare case where a misconfigured trigger would otherwise emit a path the consumer would interpret as a different tenant.

IAM

Service accountRolesScope
eventarc-gcp-eventsroles/pubsub.publisher on gcp-eventsEventarc → Pub/Sub
kafka-connect-gcproles/pubsub.subscriber on gcp-events-kafka-connectKafka Connect → Pub/Sub

Rollout Phases

The 3-phase plan from the PRD. Each phase is mostly mechanical — Terraform module instantiation, proto + handler edits, observability config.

PhaseScopeStatus
1. Pipeline infrastructureDeploy gcp-events-pipeline Terraform module (Pub/Sub topic, subscription, DLQ, IAM, Kafka Connect connector). Verify a synthetic Pub/Sub publish appears in gcp.v1.events.event with headers preserved. Deploy monitoring dashboards and alerts.Not started
2. Cloud Storage source + StorageService consumerInstantiate gcp-events-trigger-gcs for travila-uploads. Land StorageService.ReceiveEvent (subscribed handler + handleObjectFinalized / handleObjectDeleted). End-to-end test: upload → Firestore document + Lago metering event. Migrate any pre-existing storage workflows that previously called StorageManagerActor synchronously (covered by Remove Manager Actor Pattern PRD §FR-4, not duplicated here).Not started
3. First DevOps source (Cloud Build)Add apis/gcp/v1/events/cloud_build.proto. Instantiate Eventarc trigger for Cloud Build status change events. Land DevOpsAgentService.ReceiveEvent (skeleton consumer that logs build outcomes; agent reasoning out of scope). Validate the recipe works end-to-end with no infra change. Promote Cloud Build from "future" to "active" in the Future Sources Catalog.Not started

Dependency Ordering

Phase dependency graph — two external PRDs at top (Projects PRD, Remove Manager Actor Pattern PRD) shown gray-dashed as blockers; Phase 1 (Pipeline infrastructure) depends on Projects PRD; Phase 2 (Cloud Storage source + StorageService consumer) depends on both Phase 1 and Remove Manager Actor Pattern PRD; Phase 3 (First DevOps source — Cloud Build) depends on Phase 2 and is shown green to mark it as recipe validation rather than new functionality

Phase depends onReason
Phase 1 depends on the Projects PRDPath convention tenants/{tid}/projects/{pid}/... is established by Projects; the SMT in §11.2 parses against that exact prefix
Phase 2 depends on Phase 1The StorageService consumer needs the Kafka topic and connector running before it can subscribe
Phase 2 depends on the Remove Manager Actor Pattern PRDThe day-1 consumer is the new stateless StorageService from that PRD; it doesn't exist as a destination until that PRD ships
Phase 3 depends on Phase 2Validates that the recipe is what's claimed — second source, no infra change
Phase 3 is independent of _platform end-to-end supportCloud Build is _platform-scoped, but the day-1 consumer's logging skeleton doesn't need _platform to be a valid tenant_id everywhere — full _platform support is a follow-up scope per the warning in §4.2

Out of Scope (v1)

  • Real-time blocking validation. Eventarc is asynchronous; if you need to prevent an action, use a Cloud Function or a synchronous gateway, not this pipeline.
  • Bidirectional sync. This pipeline is GCP → Kafka only. Travila → GCP writes go through service-specific code paths (e.g., the services/storage GCS client wrapper).
  • Non-GCP event sources. Inbound webhooks from Stripe, GitHub, Twilio, Pipedream go through the existing webhook gateway and the Edge Idempotency Key PRD flow, not this pipeline.
  • Firestore document events. No active consumer day 1. device_tokens, user_settings, shared_plans all have non-event ownership stories. Eventarc supports Firestore natively if a future need appears — one-line addition; the design doesn't need to plumb it in advance.
  • _platform-scoped tenancy end-to-end. Recognized only in the Edge Idempotency PRD's namespacing today; pkg/tenancy validators, Firestore security rules, and the OpenFGA model don't whitelist a _-prefixed ID. Lands with the first _platform-scoped consumer's PRD (DevOps or Security agent).
  • Replicating provider strengths. Not aiming to replicate Firestore's change-stream UX, Pub/Sub filtering, or Kafka stream processing inside this pipeline. The pipeline is the dumb pipe; intelligence lives in the consumer.
  • Unified DLQ across both surfaces. The two-surface model is intentional. If operational need eventually surfaces, a small connector reading the Pub/Sub DLQ into the Kafka DLQ is the way; not in scope day 1.

Cross-References