Scheduled Jobs Roadmap
End-to-end view of how the platform builds a scheduler primitive — proactive LLM check-ins, medication reminders, digest emails, metering rollups, or any cron-driven work — on top of River and Cloud SQL Postgres. This page consolidates the Scheduled Jobs PRD (v3 — River + Cloud SQL, supersedes v2 Asynq+Firestore and v0.4 GCP Cloud Scheduler) and its dependencies into a single reference. The design is complete; the 5-phase rollout has not started.
This page is derived from the v3 PRD and its closely related dependencies:
Primary (docs/prd/):
- Scheduled Jobs — the v3 design (2026-04-30): River + Cloud SQL, same-transaction CreateJob, audit-only reconciliation, Restate ingress as Phase 1 trust boundary, 5-phase rollout (+ a deliberately-not-on-roadmap Phase 6 placeholder)
Dependencies:
- Edge Idempotency Key — canonical
Idempotency-Keysemantics. The dispatcher is the first active Client-Generated caller of that pattern: every HTTP dispatch carriesIdempotency-Key: sched:<schedule_id>:<scheduled_time_unix_millis>and Restate ingress dedupes natively - Authorization PRD — typed-subject format used by
owner_subjectand the audit columns - Projects PRD —
tenant_id+project_idcolumns and Postgres RLS policies isolate schedules per project - Production Infrastructure PRD — Cloud SQL Enterprise Plus + regional HA + GCP Managed Connection Pooling are Phase 1 hard requirements
Related:
- Tenant & Project Lifecycle Roadmap — multi-tenant substrate (
tenants/{tid}/projects/{pid}/...) the RLS scope is built on - GCP Eventarc → Kafka Pipeline Roadmap — sibling pattern for async GCP-originated events; explains why scheduler dispatches deliberately do not flow through Kafka
- Idempotency Roadmap — sibling roadmap; this design's dispatcher consumes the Edge Idempotency primitive end-to-end
The whole design refuses to operate two storage systems for one logical thing. Schedules and the queue rows that drive them go into the same Postgres transaction — drift between schedule state and queue state becomes impossible by construction, and the periodic reconciler that v2 needed to band-aid Firestore↔Redis drift collapses into an audit-only sweeper that pages on any non-zero finding.
- Cloud SQL is the source of truth for everything: schedule docs (
scheduler.schedules), queue rows (river_job— owned by the River library), and execution history (scheduler.executions, range-partitioned monthly viapg_partman). One store, one durability story. - River is the execution engine — Postgres-backed Go job queue with native cron via
PeriodicJob, leader-elected enqueueing via Postgres advisory lock, native lease + rescuer recovery,UniqueOptsfor per-tick dedup. Replaces v2's self-re-enqueue dance (workarounds forasynq.Schedulerbugs) with library-native primitives. - HTTP dispatch via Restate ingress with the canonical IETF
Idempotency-Keyheader — Restate's partition processor dedupes natively, so target gateways inherit at-most-once side effects without writing application-level middleware. - Kafka is a reactive bus, not a projection pipeline. Execution rows live in Postgres; the
scheduler.executionsKafka topic exists only for downstream consumers (auto-pause, metering). If Kafka is unavailable, execution data is durable in PG and missed events can be replayed. Kafka is not on the durability path.
Canonical reference: Scheduled Jobs PRD v3. The v0.4 (GCP Cloud Scheduler) and v2 (Asynq + Firestore) designs are superseded and preserved only for git-history forensics.
Glossary
| Term | Definition |
|---|---|
SchedulerService | Restate service. CRUD over schedules. Writes scheduler.schedules and river_job rows in one Postgres transaction wrapped in restate.Run() for journaling. |
| River | Postgres-backed Go job queue (riverqueue.com). Provides LISTEN/NOTIFY + 1s poll dispatch, lease + rescuer recovery, native PeriodicJob for cron, UniqueOpts for per-tick dedup, and an Asynqmon-equivalent admin UI. |
| River Worker | Standalone Go K8s Deployment. One fleet per target_kind (notification, llm, metering, …) for queue isolation. Runs river.Client.Start() to claim jobs and dispatch HTTP. |
| Same-transaction guarantee | Schedule INSERT + river.InsertTx() share a single Postgres transaction at SERIALIZABLE isolation. Either both rows commit or neither does. Drift between schedule state and queue state is structurally impossible — the v2 reconciler is gone. |
| Audit-only reconciliation | Nightly sweeper that diffs scheduler.schedules against river_job looking for active schedules with no future job (or vice versa). Expected count: zero. Any finding pages on-call — drift means a bug we want to know about, not routine cleanup. |
PeriodicJob | River's native cron primitive. The leader-elected periodic_job_enqueuer (singleton across the fleet via Postgres advisory lock) inserts a fresh river_job row at each cron tick. Per-tick identity is automatic — no *Task/TaskID collisions like asynq. |
UniqueOpts.ByArgs | River feature that dedupes job inserts by hash of args. Used to make restate.Run() replays safe — a second InsertTx for the same (schedule_id, scheduled_time) is a no-op. |
| Idempotency-Key | The IETF canonical header. The dispatcher always sets Idempotency-Key: sched:<schedule_id>:<scheduled_time_unix_millis> — deterministic, restart-safe, distinct per cron tick. Restate ingress dedupes natively; target gateways inherit safe retries. |
auto_pause_threshold | Per-schedule integer; range 0 (disabled) or 3..100 (default 10). After N consecutive failed dispatches, the auto-pause Kafka consumer calls PauseJob(reason="auto:consecutive_failures"). The reason lands in paused_reason so tenants distinguish auto-pause from manual pause. |
| Retention as load-bearing config | Restate ingress idempotency retention (24h) MUST be ≥ River MaxAttempts × max-backoff (~6h). A CI deploy-time check fails the deploy if either side changes. The runtime metric scheduler.dispatch.idempotency_window_safety alerts if retention silently shrinks. |
_platform sentinel | Reserved value of tenant_id / project_id for org-wide schedules (none in v1). Out of scope until first _platform-scoped schedule's PRD lands. |
Why v3 — What Got Rewritten
Two of the previous designs shipped to PRD and were both deliberately superseded.
| Version | Substrate | Why retired |
|---|---|---|
| v0.4 | GCP Cloud Scheduler | Hard 500-jobs-per-project ceiling. ~$1,000/mo at 10k schedules. Per-tenant project sharding would have meant building the abstraction we needed inside our application anyway. |
| v2 | Asynq + Firestore + Kafka projection | Two stores that could drift → load-bearing reconciler every 5 min. asynq.Scheduler reused the same *Task/TaskID across ticks (collisions from tick #2 onward). Self-re-enqueue dance worked around the bug but was a canary for "fighting the library." Scheduler restart lost in-memory cron registrations. Redis without AOF = at-most-once. |
| v3 (current) | River + Cloud SQL | Native PeriodicJob (no self-re-enqueue). Same Postgres transaction for schedule + queue (no reconciler). PG durability inherited from Cloud SQL EP + HA. LISTEN/NOTIFY wakes workers in ~ms. Tens of thousands of writes/sec headroom on a single instance. |
The cost shape changed too: ~$300/mo at 10k schedules for Cloud SQL EP + HA + Managed Pooling, CTO-approved as a line item. Phase 1 starts on the smallest EP class and scales reactively based on monitoring — no headroom pre-paid for hypothetical future domains.
Cloud SQL Is Pragmatic, Not a Migration Commitment
This Cloud SQL instance is provisioned because River requires Postgres. The decision is scheduler-scoped — the PRD does not commit to migrating other domains away from Firestore. If/when a second domain wants Postgres, instance topology (shared vs. per-domain) is a fresh decision at that time. The schedule schema is namespaced under scheduler.* so the instance can be repurposed cleanly if needed, but no part of the design depends on that.
Ownership: platform team owns the instance end-to-end (provisioning, HA topology, Managed Pooling config, backup verification, restore tests, on-call, capacity). Scheduler team owns schema definitions, query patterns, RLS policies, and application-level tuning.
Architecture
The system has two paths sharing one Postgres instance: a synchronous CRUD path (Client → KrakenD → SchedulerService → PG) and an asynchronous execution path (River → Worker → Restate → Target). Kafka sits to the side, consuming PG-committed execution events for auto-pause and metering.
The single CRUD seam is SchedulerService. Workers never write to scheduler.schedules (except the counter UPDATE inside the dispatch transaction); CRUD never enqueues River jobs except via river.InsertTx inside the create transaction. The two write paths are kept disjoint deliberately so that "who wrote this row" is always answerable by reading the row.
Component Roles
| Component | Responsibility | Owner |
|---|---|---|
| SchedulerService | Restate service. CRUD RPCs. Writes scheduler.schedules + river_job rows in one transaction. Owns tenant isolation (RLS), per-project quotas, audit-column population. | Scheduler team |
| Cloud SQL Postgres | Source of truth for schedules, queue state, execution history. EP + regional HA + Managed Connection Pooling — Phase 1 hard requirements. | Platform team (instance / ops) + Scheduler team (schema / RLS / migrations) |
| River library | Postgres-backed queue. LISTEN/NOTIFY + 1s poll. PeriodicJob for cron. Lease + rescuer for crash recovery. River UI for ops visibility. | Library — vendored |
| River Worker | Standalone K8s Deployment, one fleet per target_kind. HTTPDispatcher.Work runs the request, writes the execution row, emits Kafka, returns. | Scheduler team |
| Target Gateway | Any Restate gateway (NotificationGateway, LLMGateway, …). One new RPC per gateway: HandleScheduled<X>. Idempotency dedup is inherited from Restate ingress — no application-level middleware. | Owning gateway team |
Kafka topic scheduler.executions | Reactive event bus. Auto-pause and metering subscribe. Not on the durability path. | Scheduler team |
Postgres as the Ceiling
Tens of thousands of writes/sec on a well-provisioned Cloud SQL EP instance — far above any current scheduler load profile. The bottlenecks come in this order:
- Connection pool exhaustion (managed pooler config).
- Vertical Cloud SQL scale. Up to 96 vCPU / 624 GB RAM on EP.
- Read replicas for
executionsqueries. Hot write path isriver_job+executionsinserts; admin/dashboard queries onexecutionsoffload to a replica. - Tenant sharding. When a single primary saturates, run multiple instances and shard by
tenant_idat SchedulerService insert time. This is the canonical horizontal-scale path.
Multi-region active-active is explicitly out of scope — River maintainers have rejected distributed-SQL backends in riverqueue/river#104. If multi-region active-active ever lands on the roadmap, treat it as a queue-library re-evaluation, not a River extension.
The Same-Transaction Guarantee
This is the load-bearing structural change vs. v2.
Every CRUD handler — CreateJob, UpdateJob, PauseJob, ResumeJob, DeleteJob, DeleteJobsForProject — collapses into one Postgres transaction wrapped in restate.Run():
_, err = restate.Run(ctx, func(rc restate.RunContext) (string, error) {
return s.db.WithTx(rc.Context(), pgx.TxOptions{IsoLevel: pgx.Serializable}, func(tx pgx.Tx) (string, error) {
// 1. Set RLS scope for this transaction.
if _, err := tx.Exec(rc.Context(),
"SELECT set_config('app.tenant_id', $1, true), set_config('app.project_id', $2, true)",
tenantID, projectID); err != nil { return "", err }
// 2. Quota check inside SERIALIZABLE — two parallel CreateJobs cannot
// both pass the count race; PG aborts one and Restate replays.
if quotaExceeded(tx, tenantID, projectID, s.config) {
return "", terminalResourceExhausted("schedule quota exceeded")
}
// 3. INSERT INTO scheduler.schedules — audit columns + auto_pause_threshold.
if _, err := tx.Exec(rc.Context(), insertSchedule, args...); err != nil {
return "", err
}
// 4. river.InsertTx in the SAME tx. UniqueOpts.ByArgs makes Restate replays safe.
if _, err := s.river.InsertTx(rc.Context(), tx, ScheduledDispatchArgs{...},
&river.InsertOpts{
ScheduledAt: firstTick,
MaxAttempts: 10,
UniqueOpts: river.UniqueOpts{ByArgs: true, ByPeriod: 5 * time.Minute},
}); err != nil {
return "", err
}
return scheduleID, nil
})
})
If the handler crashes mid-transaction, Postgres rolls back atomically, Restate replays the run, and create runs again from scratch. If it commits and the handler crashes before Restate journals the result, the next replay's UniqueOpts.ByArgs makes the second River insert a no-op while the schedule INSERT collides on its primary key — both errors are non-terminal and the end state is identical. No reconciler is needed for create-time drift. This is the structural win over v2.
The audit-only sweeper survives, but its job description inverted:
| v2 reconciler | v3 sweeper | |
|---|---|---|
| Runs | Every 5 min | Nightly |
| Expected count | Non-zero (cleanup is its job) | Zero |
| Action on finding | Patch missing tasks, delete orphans | Page on-call — drift means a bug |
Execution Path
Why HTTP Dispatch (Not Direct Restate Calls)
Workers POST against arbitrary target URLs read from scheduler.schedules.target.url. The decoupling is deliberate:
- Decoupled. Scheduler doesn't import every target's proto. Adding a target = adding a URL to config.
- Same contract every target.
HandleScheduled<X>RPCs across NotificationGateway, LLMGateway, etc. share a body shape. Targets can be tested independently with any HTTP client. - Internal-only in v1. No KrakenD in the execution path. Worker → Restate ingress is cluster-internal (port 8080). Trust boundary in v1 is Restate ingress itself; per-target auth is a Phase 6 concern for external HTTP targets.
Idempotency: Restate Ingress Does the Work
The dispatcher always sets:
Idempotency-Key: sched:<schedule_id>:<scheduled_time_unix_millis>
This is the Edge Idempotency Key PRD's Client-Generated pattern. Properties:
- Deterministic from
(schedule_id, scheduled_time). Restart-safe — every River retry, rescuer redelivery, or PG-blip retry computes the same key for the same logical execution. - Distinct per cron tick. Tomorrow's 7am has a different
scheduled_time→ different key. No accidental cross-tick deduplication. - Restate ingress dedupes natively. Target gateways that are Restate services get dedupe for free — the partition processor atomically records the key, caches the committed response (success and terminal error) for the configured retention window, and on a same-key repeat returns the cached response or attaches to the in-flight invocation.
Phase 2 (NotificationGateway) and Phase 3 (LLMGateway) acceptance criteria become: "service is registered with Restate; ingress idempotency retention is configured to ≥ retry window." That's it — no target-side dedup table, no middleware. The scheduler is the Edge Idempotency PRD's first active Client-Generated caller in production.
Retention as Load-Bearing Config
MaxAttempts: 10 and the default exponential backoff cap the total retry window at roughly 6 hours. Restate ingress idempotency retention must be ≥ this window (configured to 24h, ~4× safety margin) so that a retry hours after the original dispatch still hits the dedupe table.
A CI deploy-time check asserts this invariant before any worker rollout:
restate_ingress_idempotency_retention_seconds >= river_max_attempts × river_max_backoff_seconds
With v1 settings the assertion is 86400 >= 21600 — passes with 4× margin. The check fails the deploy if either side changes without the other being raised in lockstep. A periodic job emits scheduler.dispatch.idempotency_window_safety so the same invariant is alerted on at runtime, in case Restate retention is silently shifted by central config drift.
The doc-only safeguard the PRD originally proposed was deliberately rejected — silent misconfiguration of retention turns every PG-blip retry into a duplicate medication reminder, which is exactly the failure mode Idempotency-Key exists to prevent.
Silent-Drop Risks the Dispatcher Closes
Two places where a tick can fire but leave no record. The dispatcher closes both.
- PG transaction commit failure → execution row + counters lost, but HTTP side effect already happened. The dispatcher commits after the HTTP call returns. If the COMMIT fails (PG temporarily unavailable, conflict), River retries — which re-runs the HTTP dispatch. Safe only because Restate ingress dedupes on
Idempotency-Key, so the duplicate dispatch is a no-op at the target. This dependency is load-bearing — losing it would mean every PG write blip becomes a duplicate side effect. - Kafka emit failure when PG row already committed. Kafka emit happens after PG commits. If Kafka is unavailable, returning err makes River retry the whole
Work()— re-dispatching HTTP and trying to insert another execution row. The duplicate INSERT must be caught by a dedupe mechanism on(schedule_id, scheduled_time, attempt); the PRD captures three viable shapes (per-partition unique index, drop partitioning, or app-layer dedupe) — the implementer picks one in Phase 1.
Note: River has no equivalent of Asynq's RevokeTask (silent drop with no archive). river.JobCancel is the analog of SkipRetry (terminal, retained for forensics). One footgun fewer.
One Event Per Attempt vs. One Per Logical Execution
The dispatcher inserts an execution row on every attempt (with attempt = job.Attempt) and emits a Kafka event for every committed row. A 5xx-then-success produces two execution rows (one failed,attempt=1, one success,attempt=2) and two Kafka events. Consumers must filter:
| Consumer | Filter |
|---|---|
| Auto-pause | Look at most recent N rows for a schedule_id ORDER BY completed_at DESC. Don't react to a single mid-retry failure — the latest row's status determines counter reset. |
| Metering | Count only status='success' rows, or count (schedule_id, scheduled_time) groups (one per logical execution). |
Simpler than v2's "split into two topics" approach because the source of truth is the executions table — consumers can SQL-query it whenever they need ground truth, not just react to the Kafka stream.
Postgres Schema Highlights
Full DDL lives in the PRD. The shape worth knowing:
CREATE TABLE scheduler.schedules (
schedule_id TEXT PRIMARY KEY,
tenant_id TEXT NOT NULL,
project_id TEXT NOT NULL,
owner_subject TEXT NOT NULL, -- 'user:alice'
schedule_type TEXT NOT NULL CHECK (schedule_type IN ('cron','once','recurring_interval')),
cron_expression TEXT,
timezone TEXT DEFAULT 'UTC',
target JSONB NOT NULL, -- {url, method, kind, payload}
state TEXT NOT NULL DEFAULT 'active' CHECK (state IN ('active','paused','deleted')),
auto_pause_threshold INTEGER NOT NULL DEFAULT 10
CHECK (auto_pause_threshold = 0 OR auto_pause_threshold BETWEEN 3 AND 100),
created_by_subject TEXT NOT NULL,
paused_by_subject TEXT, paused_reason TEXT, paused_at TIMESTAMPTZ,
-- (resumed_/deleted_/updated_ pairs as well)
next_trigger_at TIMESTAMPTZ,
trigger_count BIGINT NOT NULL DEFAULT 0,
failure_count BIGINT NOT NULL DEFAULT 0
);
CREATE TABLE scheduler.executions (
execution_id BIGSERIAL PRIMARY KEY,
schedule_id TEXT NOT NULL REFERENCES scheduler.schedules(schedule_id) ON DELETE CASCADE,
tenant_id TEXT NOT NULL, -- denormalized for query perf
project_id TEXT NOT NULL,
scheduled_time TIMESTAMPTZ NOT NULL,
started_at TIMESTAMPTZ NOT NULL,
completed_at TIMESTAMPTZ NOT NULL,
status TEXT NOT NULL CHECK (status IN ('success','failed','malformed')),
http_status INTEGER,
attempt INTEGER NOT NULL,
duration_ms INTEGER NOT NULL,
target_kind TEXT NOT NULL,
error TEXT,
river_job_id BIGINT
) PARTITION BY RANGE (completed_at);
Notes:
- RLS enforced on both tables. SchedulerService sets
app.tenant_idandapp.project_idper-transaction viaset_config(); cross-tenant ops use a dedicatedBYPASSRLSrole, never the application role. RLS is defense-in-depth on top of OpenFGA, not a replacement. scheduler.executionsis range-partitioned monthly viapg_partmanwith 90-day default retention (configurable per-tenant later). Partition drop is metadata-only — constant-cost cleanup regardless of execution volume.- Audit columns are denormalized — only the latest
paused_by_subjectis kept, not full pause/resume history. Engineering forensics rely on Loki structured logs; tenant-facing audit history will arrive when Retraced lands platform-wide (Phase 5 — ~1-2 days ofRetraced.Publish()plumbing, no schema change).
Schedule Types Map to River
| Type | Fields used | River mapping |
|---|---|---|
cron | cron_expression, timezone | Registered as a river.PeriodicJob at worker startup. Leader-elected periodic_job_enqueuer inserts a fresh river_job row per tick. |
once | scheduled_at | Single river.Client.InsertTx(args, InsertOpts{ScheduledAt: scheduled_at}) at create time. |
recurring_interval | interval_seconds | Worker computes next tick at end of dispatch and inserts a new River job in a separate transaction after the execution row commits. On worker crash mid-window, River retries the whole Work(); Restate's Idempotency-Key dedupes the duplicate dispatch and the next-tick insert is re-attempted. |
Cron Edits Land at Next Worker Deploy
UpdateJob to cron_expression takes effect when loadPeriodicJobsFromDB runs — i.e., next worker Deployment restart. No hot-reload in v1. Hot-reload (LISTEN/NOTIFY watcher + PeriodicJob re-registration) is ~3-5 days of work for a non-load-bearing UX feature; deferred until cron edit volume becomes a recurring support question.
The documented immediate-effect workaround — both in API docs and in the UpdateJob response itself — is PauseJob on the existing schedule + CreateJob for a new one with the desired cron. The UpdateJob response surfaces effective_at: "next_worker_deployment" so consuming UIs can communicate this clearly.
Multi-Tenancy & Quotas
| Quota | Default | Enforcement |
|---|---|---|
max_schedules_per_project | 500 | CreateJob returns RESOURCE_EXHAUSTED. Counted via RLS-scoped query inside the SERIALIZABLE transaction to prevent race conditions. |
max_schedules_per_user | 50 | Limits individual user's schedules within a project. |
min_schedule_interval | 60s | Reject cron more frequent than 1/min — River poll cadence is 1s but sub-minute precision is not a stable guarantee. |
Project deletion runs in a single Postgres transaction:
-- 1. Cancel pending River jobs for this project
UPDATE public.river_job SET state = 'cancelled', finalized_at = now()
WHERE kind = 'scheduled.dispatch' AND state IN ('available','scheduled','retryable')
AND (args->>'tenant_id') = $1 AND (args->>'project_id') = $2;
-- 2. Soft-delete schedule rows
UPDATE scheduler.schedules SET state = 'deleted', updated_at = now()
WHERE tenant_id = $1 AND project_id = $2 AND state != 'deleted';
Both steps share a transaction, so partial deletion is impossible.
Auto-Pause is Per-Schedule, Configurable
Different schedules in the same project legitimately want different thresholds (medication reminders vs. marketing digests). Stored on scheduler.schedules.auto_pause_threshold:
0— disabled (manual pause only)3..100— valid range10— default
Per-tenant defaults are not built in v1. Add only if a tenant explicitly asks. The auto-pause Kafka consumer reads the threshold per-schedule and filters on (attempt, status) to avoid reacting to mid-retry failures — only the final failure of a fully-retried dispatch counts toward the consecutive-failure counter.
Security — Restate Ingress Is the Trust Boundary
In v1, all dispatch targets are Restate services reached via Restate ingress. The trust boundary is Restate ingress itself, not application-level signatures:
- Worker → Restate ingress is internal cluster traffic (port 8080, no KrakenD in this path).
- Restate ingress validates the request shape and routes to the registered target service.
- Restate ingress's native
Idempotency-Keydedup prevents replay-based attacks within the retention window.
No HMAC signing in v1. The PRD's earlier draft called for X-Schedule-Signature; it was dropped because no v1 target verifies it (all v1 targets rely on ingress trust) and operational cost of OpenBao key rotation was non-trivial for a 2-person team. HMAC re-enters as one option in target.auth.kind for external HTTP targets if Phase 6 is ever picked up — alongside bearer, oauth2_client_credentials, and mtls.
On-Behalf-Of Authorization
Scheduled actions use permission intersection: service:scheduler scope AND the owner_subject's current permissions. If the original user loses project access, their schedules fail at execution time — the worker's RLS-scoped target read returns no row and the dispatch is JobCancel'd.
Component Inventory
New / Extended Components
| Component | Purpose | PRD section |
|---|---|---|
services/scheduler/v1/ | New Restate service. CRUD RPCs (CreateJob, GetJob, ListJobs, UpdateJob, PauseJob, ResumeJob, DeleteJob, DeleteJobsForProject, ListExecutions). | §SchedulerService API |
| Cloud SQL EP + regional HA + Managed Connection Pooling | Source of truth for everything. Provisioned by platform team. Phase 1 hard requirement. | §Phase 1 |
scheduler.schedules migration | Schedule definitions + denormalized audit columns + auto_pause_threshold. | §Postgres Schema |
scheduler.executions migration | Range-partitioned via pg_partman. | §Postgres Schema |
| RLS policies | tenant_id + project_id isolation; per-tx via SET LOCAL. | §Multi-Tenancy via Row-Level Security |
| River library + River's own migrations | Queue, cron, lease, rescuer. Migrations applied at deploy time. | §River Worker — HTTP Dispatcher |
River Worker (HTTPDispatcher) | Standalone K8s Deployment, one fleet per target_kind. | §River Worker |
| CI deploy-time idempotency-retention check | Asserts Restate retention >= MaxAttempts × max-backoff before any worker rollout. | §Idempotency Retention as Load-Bearing Config |
| Audit-only reconciliation cron | Nightly sweeper, expected-zero. | §Reconciliation — Audit-Only |
KrakenD /scheduler/* routes | Scheduler team adds directly (2-person team — no formal cross-team handoff). | §Phase 1 |
| Auto-pause Kafka consumer | Reads auto_pause_threshold per schedule; calls PauseJob(reason="auto:consecutive_failures"). | §Phase 4, §Auto-Pause |
| Metering Kafka consumer | Per-execution Lago events; dedups on (schedule_id, scheduled_time). | §Phase 4 |
Integrated Components (No Changes Required)
| Component | Role |
|---|---|
| Restate Go SDK | restate.Run() for journaling the create transaction; restate.WithIdempotencyKey is unused — the dispatcher sets the canonical IETF header directly. |
| Restate ingress | Native Idempotency-Key dedup on the partition processor. Handles cache hit on retries automatically. |
pkg/tenancy | TenantFromContext / ProjectFromContext for setting RLS scope. |
| OpenFGA | Authorization for CRUD: member for read, admin for write/delete. RLS is defense-in-depth on top, not a replacement. |
| Loki | Structured-log substrate for engineering forensics; pre-Retraced audit evidence trail. |
Observability Additions
Metrics emitted from SchedulerService and the Worker. Labels include tenant_id + project_id so per-project dashboards work day one.
| Metric | Type | Description |
|---|---|---|
scheduler.schedules.total | Gauge | Schedule count by state, target_kind, project |
scheduler.dispatch.total | Counter | Dispatch invocations by target_kind, result |
scheduler.dispatch.duration_ms | Histogram | Dispatch latency |
scheduler.dispatch.retries | Counter | River retries by attempt number |
scheduler.dispatch.idempotency_window_safety | Gauge | Runtime check against the retention invariant |
scheduler.queue.available | Gauge | River pending jobs by queue (drives KEDA HPA) |
scheduler.queue.scheduled | Gauge | River future-scheduled jobs by queue |
scheduler.audit.findings | Counter | Reconciliation sweeper findings — expected zero |
pg.connection_pool.in_use | Gauge | Cloud SQL pooler — connections in use |
pg.replication.lag_seconds | Gauge | HA replica lag (alert if > 10s) |
Alerts at runtime: pending queue depth > 1000 for 5m (warn), dispatch error rate > 10% for 5m (page), reconciliation finding any (page), Cloud SQL HA replica lag > 10s for 1m (warn), connection pool exhaustion any (page), River rescuer requeues > 50/hr (investigate).
Rollout Phases
The 5-phase plan from the PRD, plus a deliberately-not-on-roadmap Phase 6 placeholder. Phase 1 timeline depends on platform-team capacity to provision Cloud SQL — confirm provisioning lead time before committing to the 2-week estimate.
| Phase | Scope | Estimate | Status |
|---|---|---|---|
| 1. SchedulerService + Cloud SQL + River | Cloud SQL EP + HA + Managed Pooling provisioned by platform team (smallest class, scale reactively); migrations for scheduler.schedules (with audit columns + auto_pause_threshold), scheduler.executions (pg_partman), RLS policies, River's own migrations; CRUD subset (CreateJob, GetJob, ListJobs, DeleteJob) with one-PG-tx + SERIALIZABLE; River Worker Deployment per target_kind, MaxAttempts=10, ~6h retry window, canonical Idempotency-Key header; CI deploy-time check (retention ≥ retry window); KrakenD /scheduler/* routes; River UI; audit-only reconciliation cron. No HMAC in Phase 1. | 2 weeks | Not started |
| 2. NotificationGateway target | HandleScheduledNotification RPC; configure Restate ingress retention ≥ retry window; Novu dispatch; end-to-end test with duplicate-Idempotency-Key dedup verification. No application-level dedup middleware required. | 1 week | Not started |
| 3. LLMGateway target | HandleScheduledMessage RPC; same idempotency story as Phase 2; message injection into conversation thread; end-to-end test. | 1 week | Not started |
| 4. Execution history + full CRUD | Auto-pause Kafka consumer (per-schedule auto_pause_threshold, default 10, range 0/3-100); metering Kafka consumer (Lago); ListExecutions (RLS-scoped queries); UpdateJob (with effective_at: "next_worker_deployment" response field), PauseJob, ResumeJob, DeleteJobsForProject. | 1 week | Not started |
| 5. Maintenance + Retraced | OTel + Grafana dashboard (Cloud SQL + River + dispatch, including the safety-margin metric); worker drain procedure; PG minor-version upgrade runbook (sub-second blip on EP+HA); PG major-version upgrade via DMS-managed cutover (~30s blackout, ~5y cadence); load test against realistic Cloud SQL sizing; Retraced integration (~1-2 days Retraced.Publish() plumbing in CRUD handlers) when the platform-wide rollout lands. | — | Not started — Retraced timing depends on platform |
Phase Dependencies
| Phase depends on | Reason |
|---|---|
| Phase 1 depends on platform team's Cloud SQL provisioning | Phase 1 cannot start without the instance. Confirm lead time before committing to the 2-week estimate. |
| Phase 1 depends on Projects PRD | RLS scope tenant_id + project_id is established by Projects. |
| Phase 1 depends on Edge Idempotency PRD | Restate ingress dedup is the v1 trust+correctness boundary for retries. |
| Phases 2 and 3 can run in parallel | Both consume Phase 1's substrate; share the same idempotency story; ordering between them is a team-capacity decision, not a dependency. |
| Phase 4 depends on Phases 2/3 | Auto-pause + metering need real targets to observe; ListExecutions is most useful when there are executions to list. |
| Phase 5 depends only on Phase 1 | Observability + runbooks + load test are about the substrate, not the tenants. Retraced lands separately based on platform availability. |
| Phase 6 has no dependency arrow into 1-5 | Phase 1 deliberately invests no scaffolding for it (see Phase 6 placeholder). |
Future: Phase 6 (External HTTP Targets) — Placeholder Only
Not committed. No customer ask, no dated roadmap entry. The section exists in the PRD so the design space is visible — i.e., if/when external webhook scheduling is requested, the extension shape is clear and the v1 architecture doesn't need to change.
If picked up, the additions are well-scoped:
- Per-target auth on the schedule doc —
target.auth.kind∈{none, hmac, bearer, oauth2_client_credentials, mtls}withsecret_refto OpenBao. target.headersmap for per-schedule custom headers; forbidden headers (Idempotency-Key,X-Schedule-*,Host) stripped or rejected atCreateJob.- Per-tenant egress allowlist.
target.urlhost must match a configured allowlist — without this,CreateJobis an SSRF / data-exfiltration primitive. SOC2 hard requirement. - Idempotency becomes the target's problem. For non-Restate targets,
Idempotency-Keyis advisory — the worker still sends it, but the target may or may not honor it. At-least-once duplicates from worker crashes/retries are the contract. - Per-target rate limiting. Trivial extension of queue partitioning — one queue per target host.
When/if this phase is picked up, expect roughly +50 lines of PRD, additive schema fields on target, and an admin-time allowlist policy. None of the cron / delivery / scaling decisions in this PRD change.
Out of Scope for v1
- Multi-region active-active scheduling. Single-region single-cluster is an explicit accepted constraint (River is Postgres-only by design — see riverqueue/river#104). Tenant sharding is the canonical horizontal-scale path beyond single-instance vertical scale.
- Hot-reload of cron schedule edits.
UpdateJobtocron_expressiontakes effect at next workerDeploymentrestart. Documented Pause+Create-new workaround for immediate-effect needs. - HMAC signing/verification on dispatched HTTP requests. Restate ingress is the v1 trust boundary; HMAC re-enters as one option in
target.auth.kindif Phase 6 is picked up. - Tenant-facing audit log UI / append-only audit-events table. Phase 1 is denormalized columns + Loki only. Retraced lands in Phase 5; full event-log auditing arrives with it. Pre-Retraced SOC2 gap is an accepted risk; if Retraced timeline slips materially, revisit.
- Per-tenant default
auto_pause_threshold. Per-schedule only in v1. - Per-tenant Cloud SQL instances. One scheduler-scoped instance for v1; tenant sharding only when a single primary saturates.
- Broader Firestore → Postgres migration. Cloud SQL is provisioned because River requires it. Other domains stay on Firestore unless they make a separate per-domain decision.
- External HTTP destinations. Phase 6 placeholder — see above.
_platform-scoped schedules. No org-wide schedules in v1. Lands with the first_platformconsumer's PRD.
Cross-References
- Scheduled Jobs PRD v3 — the v3 design in full, including DDL, CRUD flow, dispatcher, scaling levers, and open questions
- Edge Idempotency Roadmap — sibling roadmap; this design is the first active Client-Generated caller of the canonical
Idempotency-Keyheader - Tenant & Project Lifecycle Roadmap — multi-tenant substrate the RLS scope is built on
- GCP Eventarc → Kafka Pipeline Roadmap — sibling pattern for async GCP-originated events; explains why scheduler dispatches deliberately do not flow through Kafka
- RequestContext Refactor Roadmap — typed-subject and protovalidate guarantees
owner_subjectand audit columns inherit - River library docs — particularly Periodic Jobs, Unique Jobs, Migrations
- Cloud SQL Enterprise Plus — near-zero-downtime maintenance tier used by this design
- Cloud SQL Managed Connection Pooling — Phase 1 hard requirement
pg_partman— partition management forexecutionsretention- IETF draft-ietf-httpapi-idempotency-key-header-07 — the standard the dispatcher conforms to