Skip to main content

Scheduled Jobs Roadmap

End-to-end view of how the platform builds a scheduler primitive — proactive LLM check-ins, medication reminders, digest emails, metering rollups, or any cron-driven work — on top of River and Cloud SQL Postgres. This page consolidates the Scheduled Jobs PRD (v3 — River + Cloud SQL, supersedes v2 Asynq+Firestore and v0.4 GCP Cloud Scheduler) and its dependencies into a single reference. The design is complete; the 5-phase rollout has not started.

Source PRDs

This page is derived from the v3 PRD and its closely related dependencies:

Primary (docs/prd/):

  • Scheduled Jobs — the v3 design (2026-04-30): River + Cloud SQL, same-transaction CreateJob, audit-only reconciliation, Restate ingress as Phase 1 trust boundary, 5-phase rollout (+ a deliberately-not-on-roadmap Phase 6 placeholder)

Dependencies:

  • Edge Idempotency Key — canonical Idempotency-Key semantics. The dispatcher is the first active Client-Generated caller of that pattern: every HTTP dispatch carries Idempotency-Key: sched:<schedule_id>:<scheduled_time_unix_millis> and Restate ingress dedupes natively
  • Authorization PRD — typed-subject format used by owner_subject and the audit columns
  • Projects PRDtenant_id + project_id columns and Postgres RLS policies isolate schedules per project
  • Production Infrastructure PRD — Cloud SQL Enterprise Plus + regional HA + GCP Managed Connection Pooling are Phase 1 hard requirements

Related:

Architectural Direction — Postgres Atomicity Replaces a Reconciler

The whole design refuses to operate two storage systems for one logical thing. Schedules and the queue rows that drive them go into the same Postgres transaction — drift between schedule state and queue state becomes impossible by construction, and the periodic reconciler that v2 needed to band-aid Firestore↔Redis drift collapses into an audit-only sweeper that pages on any non-zero finding.

  • Cloud SQL is the source of truth for everything: schedule docs (scheduler.schedules), queue rows (river_job — owned by the River library), and execution history (scheduler.executions, range-partitioned monthly via pg_partman). One store, one durability story.
  • River is the execution engine — Postgres-backed Go job queue with native cron via PeriodicJob, leader-elected enqueueing via Postgres advisory lock, native lease + rescuer recovery, UniqueOpts for per-tick dedup. Replaces v2's self-re-enqueue dance (workarounds for asynq.Scheduler bugs) with library-native primitives.
  • HTTP dispatch via Restate ingress with the canonical IETF Idempotency-Key header — Restate's partition processor dedupes natively, so target gateways inherit at-most-once side effects without writing application-level middleware.
  • Kafka is a reactive bus, not a projection pipeline. Execution rows live in Postgres; the scheduler.executions Kafka topic exists only for downstream consumers (auto-pause, metering). If Kafka is unavailable, execution data is durable in PG and missed events can be replayed. Kafka is not on the durability path.

Canonical reference: Scheduled Jobs PRD v3. The v0.4 (GCP Cloud Scheduler) and v2 (Asynq + Firestore) designs are superseded and preserved only for git-history forensics.

Glossary

TermDefinition
SchedulerServiceRestate service. CRUD over schedules. Writes scheduler.schedules and river_job rows in one Postgres transaction wrapped in restate.Run() for journaling.
RiverPostgres-backed Go job queue (riverqueue.com). Provides LISTEN/NOTIFY + 1s poll dispatch, lease + rescuer recovery, native PeriodicJob for cron, UniqueOpts for per-tick dedup, and an Asynqmon-equivalent admin UI.
River WorkerStandalone Go K8s Deployment. One fleet per target_kind (notification, llm, metering, …) for queue isolation. Runs river.Client.Start() to claim jobs and dispatch HTTP.
Same-transaction guaranteeSchedule INSERT + river.InsertTx() share a single Postgres transaction at SERIALIZABLE isolation. Either both rows commit or neither does. Drift between schedule state and queue state is structurally impossible — the v2 reconciler is gone.
Audit-only reconciliationNightly sweeper that diffs scheduler.schedules against river_job looking for active schedules with no future job (or vice versa). Expected count: zero. Any finding pages on-call — drift means a bug we want to know about, not routine cleanup.
PeriodicJobRiver's native cron primitive. The leader-elected periodic_job_enqueuer (singleton across the fleet via Postgres advisory lock) inserts a fresh river_job row at each cron tick. Per-tick identity is automatic — no *Task/TaskID collisions like asynq.
UniqueOpts.ByArgsRiver feature that dedupes job inserts by hash of args. Used to make restate.Run() replays safe — a second InsertTx for the same (schedule_id, scheduled_time) is a no-op.
Idempotency-KeyThe IETF canonical header. The dispatcher always sets Idempotency-Key: sched:<schedule_id>:<scheduled_time_unix_millis> — deterministic, restart-safe, distinct per cron tick. Restate ingress dedupes natively; target gateways inherit safe retries.
auto_pause_thresholdPer-schedule integer; range 0 (disabled) or 3..100 (default 10). After N consecutive failed dispatches, the auto-pause Kafka consumer calls PauseJob(reason="auto:consecutive_failures"). The reason lands in paused_reason so tenants distinguish auto-pause from manual pause.
Retention as load-bearing configRestate ingress idempotency retention (24h) MUST be River MaxAttempts × max-backoff (~6h). A CI deploy-time check fails the deploy if either side changes. The runtime metric scheduler.dispatch.idempotency_window_safety alerts if retention silently shrinks.
_platform sentinelReserved value of tenant_id / project_id for org-wide schedules (none in v1). Out of scope until first _platform-scoped schedule's PRD lands.

Why v3 — What Got Rewritten

Two of the previous designs shipped to PRD and were both deliberately superseded.

VersionSubstrateWhy retired
v0.4GCP Cloud SchedulerHard 500-jobs-per-project ceiling. ~$1,000/mo at 10k schedules. Per-tenant project sharding would have meant building the abstraction we needed inside our application anyway.
v2Asynq + Firestore + Kafka projectionTwo stores that could drift → load-bearing reconciler every 5 min. asynq.Scheduler reused the same *Task/TaskID across ticks (collisions from tick #2 onward). Self-re-enqueue dance worked around the bug but was a canary for "fighting the library." Scheduler restart lost in-memory cron registrations. Redis without AOF = at-most-once.
v3 (current)River + Cloud SQLNative PeriodicJob (no self-re-enqueue). Same Postgres transaction for schedule + queue (no reconciler). PG durability inherited from Cloud SQL EP + HA. LISTEN/NOTIFY wakes workers in ~ms. Tens of thousands of writes/sec headroom on a single instance.

The cost shape changed too: ~$300/mo at 10k schedules for Cloud SQL EP + HA + Managed Pooling, CTO-approved as a line item. Phase 1 starts on the smallest EP class and scales reactively based on monitoring — no headroom pre-paid for hypothetical future domains.

Cloud SQL Is Pragmatic, Not a Migration Commitment

This Cloud SQL instance is provisioned because River requires Postgres. The decision is scheduler-scoped — the PRD does not commit to migrating other domains away from Firestore. If/when a second domain wants Postgres, instance topology (shared vs. per-domain) is a fresh decision at that time. The schedule schema is namespaced under scheduler.* so the instance can be repurposed cleanly if needed, but no part of the design depends on that.

Ownership: platform team owns the instance end-to-end (provisioning, HA topology, Managed Pooling config, backup verification, restore tests, on-call, capacity). Scheduler team owns schema definitions, query patterns, RLS policies, and application-level tuning.

Architecture

The system has two paths sharing one Postgres instance: a synchronous CRUD path (Client → KrakenD → SchedulerService → PG) and an asynchronous execution path (River → Worker → Restate → Target). Kafka sits to the side, consuming PG-committed execution events for auto-pause and metering.

Scheduled Jobs v3 architecture — Management path is Client → KrakenD (Zitadel JWT) → SchedulerService → Cloud SQL with three tables (scheduler.schedules, public.river_job, scheduler.executions); a green banner highlights schedule INSERT + River InsertTx in one Postgres transaction. Execution path is River library (LISTEN/NOTIFY + leader-elected periodic_job_enqueuer) → River Worker (HTTPDispatcher, MaxAttempts=10, Idempotency-Key) → Restate ingress (cluster-internal, native dedup) → Target Gateway. Kafka topic scheduler.executions is shown as a reactive bus with auto-pause and metering consumers; banner reminds reader Kafka is not on the durability path. Color-coded legend explains trust + ownership boundaries

The single CRUD seam is SchedulerService. Workers never write to scheduler.schedules (except the counter UPDATE inside the dispatch transaction); CRUD never enqueues River jobs except via river.InsertTx inside the create transaction. The two write paths are kept disjoint deliberately so that "who wrote this row" is always answerable by reading the row.

Component Roles

ComponentResponsibilityOwner
SchedulerServiceRestate service. CRUD RPCs. Writes scheduler.schedules + river_job rows in one transaction. Owns tenant isolation (RLS), per-project quotas, audit-column population.Scheduler team
Cloud SQL PostgresSource of truth for schedules, queue state, execution history. EP + regional HA + Managed Connection Pooling — Phase 1 hard requirements.Platform team (instance / ops) + Scheduler team (schema / RLS / migrations)
River libraryPostgres-backed queue. LISTEN/NOTIFY + 1s poll. PeriodicJob for cron. Lease + rescuer for crash recovery. River UI for ops visibility.Library — vendored
River WorkerStandalone K8s Deployment, one fleet per target_kind. HTTPDispatcher.Work runs the request, writes the execution row, emits Kafka, returns.Scheduler team
Target GatewayAny Restate gateway (NotificationGateway, LLMGateway, …). One new RPC per gateway: HandleScheduled<X>. Idempotency dedup is inherited from Restate ingress — no application-level middleware.Owning gateway team
Kafka topic scheduler.executionsReactive event bus. Auto-pause and metering subscribe. Not on the durability path.Scheduler team

Postgres as the Ceiling

Tens of thousands of writes/sec on a well-provisioned Cloud SQL EP instance — far above any current scheduler load profile. The bottlenecks come in this order:

  1. Connection pool exhaustion (managed pooler config).
  2. Vertical Cloud SQL scale. Up to 96 vCPU / 624 GB RAM on EP.
  3. Read replicas for executions queries. Hot write path is river_job + executions inserts; admin/dashboard queries on executions offload to a replica.
  4. Tenant sharding. When a single primary saturates, run multiple instances and shard by tenant_id at SchedulerService insert time. This is the canonical horizontal-scale path.

Multi-region active-active is explicitly out of scope — River maintainers have rejected distributed-SQL backends in riverqueue/river#104. If multi-region active-active ever lands on the roadmap, treat it as a queue-library re-evaluation, not a River extension.

The Same-Transaction Guarantee

This is the load-bearing structural change vs. v2.

Same-transaction guarantee comparison — left column shows v2 (Asynq + Firestore) where CreateJob writes Firestore then Asynq via separate API calls; a 💥 marker shows the handler crashing between the two writes leaves the system in a drift state; v2 mitigation is a load-bearing reconciler running every 5 minutes; v2 footguns listed beneath: asynq.Scheduler TaskID collisions, self-re-enqueue dance, Scheduler restart loses in-memory state, execution history projected via Kafka → FirebaseBridge → Firestore, at-most-once if Redis runs without AOF. Right column shows v3 (River + Cloud SQL) where CreateJob runs inside a restate.Run() journal, executes one Postgres SERIALIZABLE transaction with four steps: SET LOCAL RLS, quota check, INSERT scheduler.schedules, river.InsertTx — all atomically committed; result is drift impossible by construction; reconciler reduced to nightly audit-only sweeper that pages on any finding

Every CRUD handler — CreateJob, UpdateJob, PauseJob, ResumeJob, DeleteJob, DeleteJobsForProject — collapses into one Postgres transaction wrapped in restate.Run():

_, err = restate.Run(ctx, func(rc restate.RunContext) (string, error) {
return s.db.WithTx(rc.Context(), pgx.TxOptions{IsoLevel: pgx.Serializable}, func(tx pgx.Tx) (string, error) {
// 1. Set RLS scope for this transaction.
if _, err := tx.Exec(rc.Context(),
"SELECT set_config('app.tenant_id', $1, true), set_config('app.project_id', $2, true)",
tenantID, projectID); err != nil { return "", err }

// 2. Quota check inside SERIALIZABLE — two parallel CreateJobs cannot
// both pass the count race; PG aborts one and Restate replays.
if quotaExceeded(tx, tenantID, projectID, s.config) {
return "", terminalResourceExhausted("schedule quota exceeded")
}

// 3. INSERT INTO scheduler.schedules — audit columns + auto_pause_threshold.
if _, err := tx.Exec(rc.Context(), insertSchedule, args...); err != nil {
return "", err
}

// 4. river.InsertTx in the SAME tx. UniqueOpts.ByArgs makes Restate replays safe.
if _, err := s.river.InsertTx(rc.Context(), tx, ScheduledDispatchArgs{...},
&river.InsertOpts{
ScheduledAt: firstTick,
MaxAttempts: 10,
UniqueOpts: river.UniqueOpts{ByArgs: true, ByPeriod: 5 * time.Minute},
}); err != nil {
return "", err
}

return scheduleID, nil
})
})

If the handler crashes mid-transaction, Postgres rolls back atomically, Restate replays the run, and create runs again from scratch. If it commits and the handler crashes before Restate journals the result, the next replay's UniqueOpts.ByArgs makes the second River insert a no-op while the schedule INSERT collides on its primary key — both errors are non-terminal and the end state is identical. No reconciler is needed for create-time drift. This is the structural win over v2.

The audit-only sweeper survives, but its job description inverted:

v2 reconcilerv3 sweeper
RunsEvery 5 minNightly
Expected countNon-zero (cleanup is its job)Zero
Action on findingPatch missing tasks, delete orphansPage on-call — drift means a bug

Execution Path

Execution sequence — 10 numbered steps across 7 swimlanes (River library, Cloud SQL, River Worker, Restate ingress, Target Gateway, Kafka, Auto-pause/Metering consumers): 1. cron tick — periodic_job_enqueuer INSERTs next-tick row into river_job; 2. LISTEN/NOTIFY wakes worker (~ms) or 1s poll catches future-scheduled job; 3. atomic claim via FOR UPDATE SKIP LOCKED + read target config under RLS; 4. POST target.url with Idempotency-Key: sched:&lt;sid&gt;:&lt;ms&gt;; 5. Restate partition processor records key atomically and invokes handler on NEW KEY; (cached response on retry — no duplicate side effect); 6. target dispatches Novu push or message into thread; 7. 200 OK back via Restate ingress; 8. ONE PG transaction: INSERT scheduler.executions + UPDATE scheduler.schedules counters + River ack; 9. produce execution event to Kafka post-commit best-effort; 10. fan out to auto-pause and metering consumers. Side notes flag recurring_interval enqueueing in a separate tx, retry-vs-cancel semantics, and the deploy-time CI check on idempotency retention

Why HTTP Dispatch (Not Direct Restate Calls)

Workers POST against arbitrary target URLs read from scheduler.schedules.target.url. The decoupling is deliberate:

  • Decoupled. Scheduler doesn't import every target's proto. Adding a target = adding a URL to config.
  • Same contract every target. HandleScheduled<X> RPCs across NotificationGateway, LLMGateway, etc. share a body shape. Targets can be tested independently with any HTTP client.
  • Internal-only in v1. No KrakenD in the execution path. Worker → Restate ingress is cluster-internal (port 8080). Trust boundary in v1 is Restate ingress itself; per-target auth is a Phase 6 concern for external HTTP targets.

Idempotency: Restate Ingress Does the Work

The dispatcher always sets:

Idempotency-Key: sched:<schedule_id>:<scheduled_time_unix_millis>

This is the Edge Idempotency Key PRD's Client-Generated pattern. Properties:

  • Deterministic from (schedule_id, scheduled_time). Restart-safe — every River retry, rescuer redelivery, or PG-blip retry computes the same key for the same logical execution.
  • Distinct per cron tick. Tomorrow's 7am has a different scheduled_time → different key. No accidental cross-tick deduplication.
  • Restate ingress dedupes natively. Target gateways that are Restate services get dedupe for free — the partition processor atomically records the key, caches the committed response (success and terminal error) for the configured retention window, and on a same-key repeat returns the cached response or attaches to the in-flight invocation.

Phase 2 (NotificationGateway) and Phase 3 (LLMGateway) acceptance criteria become: "service is registered with Restate; ingress idempotency retention is configured to ≥ retry window." That's it — no target-side dedup table, no middleware. The scheduler is the Edge Idempotency PRD's first active Client-Generated caller in production.

Retention as Load-Bearing Config

MaxAttempts: 10 and the default exponential backoff cap the total retry window at roughly 6 hours. Restate ingress idempotency retention must be ≥ this window (configured to 24h, ~4× safety margin) so that a retry hours after the original dispatch still hits the dedupe table.

A CI deploy-time check asserts this invariant before any worker rollout:

restate_ingress_idempotency_retention_seconds  >=  river_max_attempts × river_max_backoff_seconds

With v1 settings the assertion is 86400 >= 21600 — passes with 4× margin. The check fails the deploy if either side changes without the other being raised in lockstep. A periodic job emits scheduler.dispatch.idempotency_window_safety so the same invariant is alerted on at runtime, in case Restate retention is silently shifted by central config drift.

The doc-only safeguard the PRD originally proposed was deliberately rejected — silent misconfiguration of retention turns every PG-blip retry into a duplicate medication reminder, which is exactly the failure mode Idempotency-Key exists to prevent.

Silent-Drop Risks the Dispatcher Closes

Two places where a tick can fire but leave no record. The dispatcher closes both.

  1. PG transaction commit failure → execution row + counters lost, but HTTP side effect already happened. The dispatcher commits after the HTTP call returns. If the COMMIT fails (PG temporarily unavailable, conflict), River retries — which re-runs the HTTP dispatch. Safe only because Restate ingress dedupes on Idempotency-Key, so the duplicate dispatch is a no-op at the target. This dependency is load-bearing — losing it would mean every PG write blip becomes a duplicate side effect.
  2. Kafka emit failure when PG row already committed. Kafka emit happens after PG commits. If Kafka is unavailable, returning err makes River retry the whole Work() — re-dispatching HTTP and trying to insert another execution row. The duplicate INSERT must be caught by a dedupe mechanism on (schedule_id, scheduled_time, attempt); the PRD captures three viable shapes (per-partition unique index, drop partitioning, or app-layer dedupe) — the implementer picks one in Phase 1.

Note: River has no equivalent of Asynq's RevokeTask (silent drop with no archive). river.JobCancel is the analog of SkipRetry (terminal, retained for forensics). One footgun fewer.

One Event Per Attempt vs. One Per Logical Execution

The dispatcher inserts an execution row on every attempt (with attempt = job.Attempt) and emits a Kafka event for every committed row. A 5xx-then-success produces two execution rows (one failed,attempt=1, one success,attempt=2) and two Kafka events. Consumers must filter:

ConsumerFilter
Auto-pauseLook at most recent N rows for a schedule_id ORDER BY completed_at DESC. Don't react to a single mid-retry failure — the latest row's status determines counter reset.
MeteringCount only status='success' rows, or count (schedule_id, scheduled_time) groups (one per logical execution).

Simpler than v2's "split into two topics" approach because the source of truth is the executions table — consumers can SQL-query it whenever they need ground truth, not just react to the Kafka stream.

Postgres Schema Highlights

Full DDL lives in the PRD. The shape worth knowing:

CREATE TABLE scheduler.schedules (
schedule_id TEXT PRIMARY KEY,
tenant_id TEXT NOT NULL,
project_id TEXT NOT NULL,
owner_subject TEXT NOT NULL, -- 'user:alice'
schedule_type TEXT NOT NULL CHECK (schedule_type IN ('cron','once','recurring_interval')),
cron_expression TEXT,
timezone TEXT DEFAULT 'UTC',
target JSONB NOT NULL, -- {url, method, kind, payload}
state TEXT NOT NULL DEFAULT 'active' CHECK (state IN ('active','paused','deleted')),
auto_pause_threshold INTEGER NOT NULL DEFAULT 10
CHECK (auto_pause_threshold = 0 OR auto_pause_threshold BETWEEN 3 AND 100),
created_by_subject TEXT NOT NULL,
paused_by_subject TEXT, paused_reason TEXT, paused_at TIMESTAMPTZ,
-- (resumed_/deleted_/updated_ pairs as well)
next_trigger_at TIMESTAMPTZ,
trigger_count BIGINT NOT NULL DEFAULT 0,
failure_count BIGINT NOT NULL DEFAULT 0
);

CREATE TABLE scheduler.executions (
execution_id BIGSERIAL PRIMARY KEY,
schedule_id TEXT NOT NULL REFERENCES scheduler.schedules(schedule_id) ON DELETE CASCADE,
tenant_id TEXT NOT NULL, -- denormalized for query perf
project_id TEXT NOT NULL,
scheduled_time TIMESTAMPTZ NOT NULL,
started_at TIMESTAMPTZ NOT NULL,
completed_at TIMESTAMPTZ NOT NULL,
status TEXT NOT NULL CHECK (status IN ('success','failed','malformed')),
http_status INTEGER,
attempt INTEGER NOT NULL,
duration_ms INTEGER NOT NULL,
target_kind TEXT NOT NULL,
error TEXT,
river_job_id BIGINT
) PARTITION BY RANGE (completed_at);

Notes:

  • RLS enforced on both tables. SchedulerService sets app.tenant_id and app.project_id per-transaction via set_config(); cross-tenant ops use a dedicated BYPASSRLS role, never the application role. RLS is defense-in-depth on top of OpenFGA, not a replacement.
  • scheduler.executions is range-partitioned monthly via pg_partman with 90-day default retention (configurable per-tenant later). Partition drop is metadata-only — constant-cost cleanup regardless of execution volume.
  • Audit columns are denormalized — only the latest paused_by_subject is kept, not full pause/resume history. Engineering forensics rely on Loki structured logs; tenant-facing audit history will arrive when Retraced lands platform-wide (Phase 5 — ~1-2 days of Retraced.Publish() plumbing, no schema change).

Schedule Types Map to River

TypeFields usedRiver mapping
croncron_expression, timezoneRegistered as a river.PeriodicJob at worker startup. Leader-elected periodic_job_enqueuer inserts a fresh river_job row per tick.
oncescheduled_atSingle river.Client.InsertTx(args, InsertOpts{ScheduledAt: scheduled_at}) at create time.
recurring_intervalinterval_secondsWorker computes next tick at end of dispatch and inserts a new River job in a separate transaction after the execution row commits. On worker crash mid-window, River retries the whole Work(); Restate's Idempotency-Key dedupes the duplicate dispatch and the next-tick insert is re-attempted.

Cron Edits Land at Next Worker Deploy

UpdateJob to cron_expression takes effect when loadPeriodicJobsFromDB runs — i.e., next worker Deployment restart. No hot-reload in v1. Hot-reload (LISTEN/NOTIFY watcher + PeriodicJob re-registration) is ~3-5 days of work for a non-load-bearing UX feature; deferred until cron edit volume becomes a recurring support question.

The documented immediate-effect workaround — both in API docs and in the UpdateJob response itself — is PauseJob on the existing schedule + CreateJob for a new one with the desired cron. The UpdateJob response surfaces effective_at: "next_worker_deployment" so consuming UIs can communicate this clearly.

Multi-Tenancy & Quotas

QuotaDefaultEnforcement
max_schedules_per_project500CreateJob returns RESOURCE_EXHAUSTED. Counted via RLS-scoped query inside the SERIALIZABLE transaction to prevent race conditions.
max_schedules_per_user50Limits individual user's schedules within a project.
min_schedule_interval60sReject cron more frequent than 1/min — River poll cadence is 1s but sub-minute precision is not a stable guarantee.

Project deletion runs in a single Postgres transaction:

-- 1. Cancel pending River jobs for this project
UPDATE public.river_job SET state = 'cancelled', finalized_at = now()
WHERE kind = 'scheduled.dispatch' AND state IN ('available','scheduled','retryable')
AND (args->>'tenant_id') = $1 AND (args->>'project_id') = $2;

-- 2. Soft-delete schedule rows
UPDATE scheduler.schedules SET state = 'deleted', updated_at = now()
WHERE tenant_id = $1 AND project_id = $2 AND state != 'deleted';

Both steps share a transaction, so partial deletion is impossible.

Auto-Pause is Per-Schedule, Configurable

Different schedules in the same project legitimately want different thresholds (medication reminders vs. marketing digests). Stored on scheduler.schedules.auto_pause_threshold:

  • 0 — disabled (manual pause only)
  • 3..100 — valid range
  • 10 — default

Per-tenant defaults are not built in v1. Add only if a tenant explicitly asks. The auto-pause Kafka consumer reads the threshold per-schedule and filters on (attempt, status) to avoid reacting to mid-retry failures — only the final failure of a fully-retried dispatch counts toward the consecutive-failure counter.

Security — Restate Ingress Is the Trust Boundary

In v1, all dispatch targets are Restate services reached via Restate ingress. The trust boundary is Restate ingress itself, not application-level signatures:

  • Worker → Restate ingress is internal cluster traffic (port 8080, no KrakenD in this path).
  • Restate ingress validates the request shape and routes to the registered target service.
  • Restate ingress's native Idempotency-Key dedup prevents replay-based attacks within the retention window.

No HMAC signing in v1. The PRD's earlier draft called for X-Schedule-Signature; it was dropped because no v1 target verifies it (all v1 targets rely on ingress trust) and operational cost of OpenBao key rotation was non-trivial for a 2-person team. HMAC re-enters as one option in target.auth.kind for external HTTP targets if Phase 6 is ever picked up — alongside bearer, oauth2_client_credentials, and mtls.

On-Behalf-Of Authorization

Scheduled actions use permission intersection: service:scheduler scope AND the owner_subject's current permissions. If the original user loses project access, their schedules fail at execution time — the worker's RLS-scoped target read returns no row and the dispatch is JobCancel'd.

Component Inventory

New / Extended Components

ComponentPurposePRD section
services/scheduler/v1/New Restate service. CRUD RPCs (CreateJob, GetJob, ListJobs, UpdateJob, PauseJob, ResumeJob, DeleteJob, DeleteJobsForProject, ListExecutions).§SchedulerService API
Cloud SQL EP + regional HA + Managed Connection PoolingSource of truth for everything. Provisioned by platform team. Phase 1 hard requirement.§Phase 1
scheduler.schedules migrationSchedule definitions + denormalized audit columns + auto_pause_threshold.§Postgres Schema
scheduler.executions migrationRange-partitioned via pg_partman.§Postgres Schema
RLS policiestenant_id + project_id isolation; per-tx via SET LOCAL.§Multi-Tenancy via Row-Level Security
River library + River's own migrationsQueue, cron, lease, rescuer. Migrations applied at deploy time.§River Worker — HTTP Dispatcher
River Worker (HTTPDispatcher)Standalone K8s Deployment, one fleet per target_kind.§River Worker
CI deploy-time idempotency-retention checkAsserts Restate retention >= MaxAttempts × max-backoff before any worker rollout.§Idempotency Retention as Load-Bearing Config
Audit-only reconciliation cronNightly sweeper, expected-zero.§Reconciliation — Audit-Only
KrakenD /scheduler/* routesScheduler team adds directly (2-person team — no formal cross-team handoff).§Phase 1
Auto-pause Kafka consumerReads auto_pause_threshold per schedule; calls PauseJob(reason="auto:consecutive_failures").§Phase 4, §Auto-Pause
Metering Kafka consumerPer-execution Lago events; dedups on (schedule_id, scheduled_time).§Phase 4

Integrated Components (No Changes Required)

ComponentRole
Restate Go SDKrestate.Run() for journaling the create transaction; restate.WithIdempotencyKey is unused — the dispatcher sets the canonical IETF header directly.
Restate ingressNative Idempotency-Key dedup on the partition processor. Handles cache hit on retries automatically.
pkg/tenancyTenantFromContext / ProjectFromContext for setting RLS scope.
OpenFGAAuthorization for CRUD: member for read, admin for write/delete. RLS is defense-in-depth on top, not a replacement.
LokiStructured-log substrate for engineering forensics; pre-Retraced audit evidence trail.

Observability Additions

Metrics emitted from SchedulerService and the Worker. Labels include tenant_id + project_id so per-project dashboards work day one.

MetricTypeDescription
scheduler.schedules.totalGaugeSchedule count by state, target_kind, project
scheduler.dispatch.totalCounterDispatch invocations by target_kind, result
scheduler.dispatch.duration_msHistogramDispatch latency
scheduler.dispatch.retriesCounterRiver retries by attempt number
scheduler.dispatch.idempotency_window_safetyGaugeRuntime check against the retention invariant
scheduler.queue.availableGaugeRiver pending jobs by queue (drives KEDA HPA)
scheduler.queue.scheduledGaugeRiver future-scheduled jobs by queue
scheduler.audit.findingsCounterReconciliation sweeper findings — expected zero
pg.connection_pool.in_useGaugeCloud SQL pooler — connections in use
pg.replication.lag_secondsGaugeHA replica lag (alert if > 10s)

Alerts at runtime: pending queue depth > 1000 for 5m (warn), dispatch error rate > 10% for 5m (page), reconciliation finding any (page), Cloud SQL HA replica lag > 10s for 1m (warn), connection pool exhaustion any (page), River rescuer requeues > 50/hr (investigate).

Rollout Phases

The 5-phase plan from the PRD, plus a deliberately-not-on-roadmap Phase 6 placeholder. Phase 1 timeline depends on platform-team capacity to provision Cloud SQL — confirm provisioning lead time before committing to the 2-week estimate.

Scheduled Jobs v3 phase dependencies — five active phases plus Phase 6 placeholder. External dependencies at top: Platform team provisions Cloud SQL EP+HA+Managed Pooling, Projects PRD provides tenant_id+project_id RLS scope, Edge Idempotency provides Restate ingress dedup, Retraced rollout triggers Phase 5 audit work, External customer ask explicitly NOT on roadmap. Phase 1 (SchedulerService + Cloud SQL + River, 2 weeks) is the substrate. Phase 2 (NotificationGateway target, 1w) and Phase 3 (LLMGateway target, 1w) chain after; both share the same idempotency story. Phase 4 (Execution history + full CRUD with auto-pause + metering, 1w) follows. Phase 5 (Maintenance + Retraced when platform lands) depends only on Phase 1, runs in parallel with 2/3/4. Phase 6 (External HTTP targets) is a red dashed placeholder with no dependency arrow — Phase 1 deliberately invests no scaffolding for it. Footer notes explain the parallelism and the explicit non-commitment to Phase 6

PhaseScopeEstimateStatus
1. SchedulerService + Cloud SQL + RiverCloud SQL EP + HA + Managed Pooling provisioned by platform team (smallest class, scale reactively); migrations for scheduler.schedules (with audit columns + auto_pause_threshold), scheduler.executions (pg_partman), RLS policies, River's own migrations; CRUD subset (CreateJob, GetJob, ListJobs, DeleteJob) with one-PG-tx + SERIALIZABLE; River Worker Deployment per target_kind, MaxAttempts=10, ~6h retry window, canonical Idempotency-Key header; CI deploy-time check (retention ≥ retry window); KrakenD /scheduler/* routes; River UI; audit-only reconciliation cron. No HMAC in Phase 1.2 weeksNot started
2. NotificationGateway targetHandleScheduledNotification RPC; configure Restate ingress retention ≥ retry window; Novu dispatch; end-to-end test with duplicate-Idempotency-Key dedup verification. No application-level dedup middleware required.1 weekNot started
3. LLMGateway targetHandleScheduledMessage RPC; same idempotency story as Phase 2; message injection into conversation thread; end-to-end test.1 weekNot started
4. Execution history + full CRUDAuto-pause Kafka consumer (per-schedule auto_pause_threshold, default 10, range 0/3-100); metering Kafka consumer (Lago); ListExecutions (RLS-scoped queries); UpdateJob (with effective_at: "next_worker_deployment" response field), PauseJob, ResumeJob, DeleteJobsForProject.1 weekNot started
5. Maintenance + RetracedOTel + Grafana dashboard (Cloud SQL + River + dispatch, including the safety-margin metric); worker drain procedure; PG minor-version upgrade runbook (sub-second blip on EP+HA); PG major-version upgrade via DMS-managed cutover (~30s blackout, ~5y cadence); load test against realistic Cloud SQL sizing; Retraced integration (~1-2 days Retraced.Publish() plumbing in CRUD handlers) when the platform-wide rollout lands.Not started — Retraced timing depends on platform

Phase Dependencies

Phase depends onReason
Phase 1 depends on platform team's Cloud SQL provisioningPhase 1 cannot start without the instance. Confirm lead time before committing to the 2-week estimate.
Phase 1 depends on Projects PRDRLS scope tenant_id + project_id is established by Projects.
Phase 1 depends on Edge Idempotency PRDRestate ingress dedup is the v1 trust+correctness boundary for retries.
Phases 2 and 3 can run in parallelBoth consume Phase 1's substrate; share the same idempotency story; ordering between them is a team-capacity decision, not a dependency.
Phase 4 depends on Phases 2/3Auto-pause + metering need real targets to observe; ListExecutions is most useful when there are executions to list.
Phase 5 depends only on Phase 1Observability + runbooks + load test are about the substrate, not the tenants. Retraced lands separately based on platform availability.
Phase 6 has no dependency arrow into 1-5Phase 1 deliberately invests no scaffolding for it (see Phase 6 placeholder).

Future: Phase 6 (External HTTP Targets) — Placeholder Only

Not committed. No customer ask, no dated roadmap entry. The section exists in the PRD so the design space is visible — i.e., if/when external webhook scheduling is requested, the extension shape is clear and the v1 architecture doesn't need to change.

If picked up, the additions are well-scoped:

  1. Per-target auth on the schedule doc — target.auth.kind{none, hmac, bearer, oauth2_client_credentials, mtls} with secret_ref to OpenBao.
  2. target.headers map for per-schedule custom headers; forbidden headers (Idempotency-Key, X-Schedule-*, Host) stripped or rejected at CreateJob.
  3. Per-tenant egress allowlist. target.url host must match a configured allowlist — without this, CreateJob is an SSRF / data-exfiltration primitive. SOC2 hard requirement.
  4. Idempotency becomes the target's problem. For non-Restate targets, Idempotency-Key is advisory — the worker still sends it, but the target may or may not honor it. At-least-once duplicates from worker crashes/retries are the contract.
  5. Per-target rate limiting. Trivial extension of queue partitioning — one queue per target host.

When/if this phase is picked up, expect roughly +50 lines of PRD, additive schema fields on target, and an admin-time allowlist policy. None of the cron / delivery / scaling decisions in this PRD change.

Out of Scope for v1

  • Multi-region active-active scheduling. Single-region single-cluster is an explicit accepted constraint (River is Postgres-only by design — see riverqueue/river#104). Tenant sharding is the canonical horizontal-scale path beyond single-instance vertical scale.
  • Hot-reload of cron schedule edits. UpdateJob to cron_expression takes effect at next worker Deployment restart. Documented Pause+Create-new workaround for immediate-effect needs.
  • HMAC signing/verification on dispatched HTTP requests. Restate ingress is the v1 trust boundary; HMAC re-enters as one option in target.auth.kind if Phase 6 is picked up.
  • Tenant-facing audit log UI / append-only audit-events table. Phase 1 is denormalized columns + Loki only. Retraced lands in Phase 5; full event-log auditing arrives with it. Pre-Retraced SOC2 gap is an accepted risk; if Retraced timeline slips materially, revisit.
  • Per-tenant default auto_pause_threshold. Per-schedule only in v1.
  • Per-tenant Cloud SQL instances. One scheduler-scoped instance for v1; tenant sharding only when a single primary saturates.
  • Broader Firestore → Postgres migration. Cloud SQL is provisioned because River requires it. Other domains stay on Firestore unless they make a separate per-domain decision.
  • External HTTP destinations. Phase 6 placeholder — see above.
  • _platform-scoped schedules. No org-wide schedules in v1. Lands with the first _platform consumer's PRD.

Cross-References