Scheduled Jobs Roadmap

End-to-end view of how the platform builds a scheduler primitive — proactive LLM check-ins, medication reminders, digest emails, metering rollups, or any cron-driven work — on top of River and Cloud SQL Postgres. This page consolidates the Scheduled Jobs PRD (v3 — River + Cloud SQL, supersedes v2 Asynq+Firestore and v0.4 GCP Cloud Scheduler) and its dependencies into a single reference. The design is complete; the 5-phase rollout has not started.

Source PRDs

This page is derived from the v3 PRD and its closely related dependencies:

Primary (docs/prd/):

Scheduled Jobs — the v3 design (2026-04-30): River + Cloud SQL, same-transaction CreateJob, audit-only reconciliation, Restate ingress as Phase 1 trust boundary, 5-phase rollout (+ a deliberately-not-on-roadmap Phase 6 placeholder)

Dependencies:

Edge Idempotency Key — canonical Idempotency-Key semantics. The dispatcher is the first active Client-Generated caller of that pattern: every HTTP dispatch carries Idempotency-Key: sched:<schedule_id>:<scheduled_time_unix_millis> and Restate ingress dedupes natively
Authorization PRD — typed-subject format used by owner_subject and the audit columns
Projects PRD — tenant_id + project_id columns and Postgres RLS policies isolate schedules per project
Production Infrastructure PRD — Cloud SQL Enterprise Plus + regional HA + GCP Managed Connection Pooling are Phase 1 hard requirements

Related:

Tenant & Project Lifecycle Roadmap — multi-tenant substrate (tenants/{tid}/projects/{pid}/...) the RLS scope is built on
GCP Eventarc → Kafka Pipeline Roadmap — sibling pattern for async GCP-originated events; explains why scheduler dispatches deliberately do not flow through Kafka
Idempotency Roadmap — sibling roadmap; this design's dispatcher consumes the Edge Idempotency primitive end-to-end

Architectural Direction — Postgres Atomicity Replaces a Reconciler

The whole design refuses to operate two storage systems for one logical thing. Schedules and the queue rows that drive them go into the same Postgres transaction — drift between schedule state and queue state becomes impossible by construction, and the periodic reconciler that v2 needed to band-aid Firestore↔Redis drift collapses into an audit-only sweeper that pages on any non-zero finding.

Cloud SQL is the source of truth for everything: schedule docs (scheduler.schedules), queue rows (river_job — owned by the River library), and execution history (scheduler.executions, range-partitioned monthly via pg_partman). One store, one durability story.
River is the execution engine — Postgres-backed Go job queue with native cron via PeriodicJob, leader-elected enqueueing via Postgres advisory lock, native lease + rescuer recovery, UniqueOpts for per-tick dedup. Replaces v2's self-re-enqueue dance (workarounds for asynq.Scheduler bugs) with library-native primitives.
HTTP dispatch via Restate ingress with the canonical IETF Idempotency-Key header — Restate's partition processor dedupes natively, so target gateways inherit at-most-once side effects without writing application-level middleware.
Kafka is a reactive bus, not a projection pipeline. Execution rows live in Postgres; the scheduler.executions Kafka topic exists only for downstream consumers (auto-pause, metering). If Kafka is unavailable, execution data is durable in PG and missed events can be replayed. Kafka is not on the durability path.

Canonical reference: Scheduled Jobs PRD v3. The v0.4 (GCP Cloud Scheduler) and v2 (Asynq + Firestore) designs are superseded and preserved only for git-history forensics.

Glossary

Term	Definition
`SchedulerService`	Restate service. CRUD over schedules. Writes `scheduler.schedules` and `river_job` rows in one Postgres transaction wrapped in `restate.Run()` for journaling.
River	Postgres-backed Go job queue (riverqueue.com). Provides `LISTEN/NOTIFY` + 1s poll dispatch, lease + rescuer recovery, native `PeriodicJob` for cron, `UniqueOpts` for per-tick dedup, and an Asynqmon-equivalent admin UI.
River Worker	Standalone Go K8s `Deployment`. One fleet per `target_kind` (notification, llm, metering, …) for queue isolation. Runs `river.Client.Start()` to claim jobs and dispatch HTTP.
Same-transaction guarantee	Schedule INSERT + `river.InsertTx()` share a single Postgres transaction at SERIALIZABLE isolation. Either both rows commit or neither does. Drift between schedule state and queue state is structurally impossible — the v2 reconciler is gone.
Audit-only reconciliation	Nightly sweeper that diffs `scheduler.schedules` against `river_job` looking for active schedules with no future job (or vice versa). Expected count: zero. Any finding pages on-call — drift means a bug we want to know about, not routine cleanup.
`PeriodicJob`	River's native cron primitive. The leader-elected `periodic_job_enqueuer` (singleton across the fleet via Postgres advisory lock) inserts a fresh `river_job` row at each cron tick. Per-tick identity is automatic — no `*Task`/TaskID collisions like asynq.
`UniqueOpts.ByArgs`	River feature that dedupes job inserts by hash of args. Used to make `restate.Run()` replays safe — a second InsertTx for the same `(schedule_id, scheduled_time)` is a no-op.
Idempotency-Key	The IETF canonical header. The dispatcher always sets `Idempotency-Key: sched:<schedule_id>:<scheduled_time_unix_millis>` — deterministic, restart-safe, distinct per cron tick. Restate ingress dedupes natively; target gateways inherit safe retries.
`auto_pause_threshold`	Per-schedule integer; range `0` (disabled) or `3..100` (default `10`). After N consecutive failed dispatches, the auto-pause Kafka consumer calls `PauseJob(reason="auto:consecutive_failures")`. The reason lands in `paused_reason` so tenants distinguish auto-pause from manual pause.
Retention as load-bearing config	Restate ingress idempotency retention (24h) MUST be `≥` River MaxAttempts × max-backoff (~6h). A CI deploy-time check fails the deploy if either side changes. The runtime metric `scheduler.dispatch.idempotency_window_safety` alerts if retention silently shrinks.
`_platform` sentinel	Reserved value of `tenant_id` / `project_id` for org-wide schedules (none in v1). Out of scope until first `_platform`-scoped schedule's PRD lands.

Why v3 — What Got Rewritten

Two of the previous designs shipped to PRD and were both deliberately superseded.

Version	Substrate	Why retired
v0.4	GCP Cloud Scheduler	Hard 500-jobs-per-project ceiling. ~$1,000/mo at 10k schedules. Per-tenant project sharding would have meant building the abstraction we needed inside our application anyway.
v2	Asynq + Firestore + Kafka projection	Two stores that could drift → load-bearing reconciler every 5 min. `asynq.Scheduler` reused the same `*Task`/TaskID across ticks (collisions from tick #2 onward). Self-re-enqueue dance worked around the bug but was a canary for "fighting the library." Scheduler restart lost in-memory cron registrations. Redis without AOF = at-most-once.
v3 (current)	River + Cloud SQL	Native `PeriodicJob` (no self-re-enqueue). Same Postgres transaction for schedule + queue (no reconciler). PG durability inherited from Cloud SQL EP + HA. `LISTEN/NOTIFY` wakes workers in ~ms. Tens of thousands of writes/sec headroom on a single instance.

The cost shape changed too: ~$300/mo at 10k schedules for Cloud SQL EP + HA + Managed Pooling, CTO-approved as a line item. Phase 1 starts on the smallest EP class and scales reactively based on monitoring — no headroom pre-paid for hypothetical future domains.

Cloud SQL Is Pragmatic, Not a Migration Commitment

This Cloud SQL instance is provisioned because River requires Postgres. The decision is scheduler-scoped — the PRD does not commit to migrating other domains away from Firestore. If/when a second domain wants Postgres, instance topology (shared vs. per-domain) is a fresh decision at that time. The schedule schema is namespaced under scheduler.* so the instance can be repurposed cleanly if needed, but no part of the design depends on that.

Ownership: platform team owns the instance end-to-end (provisioning, HA topology, Managed Pooling config, backup verification, restore tests, on-call, capacity). Scheduler team owns schema definitions, query patterns, RLS policies, and application-level tuning.

Architecture

The system has two paths sharing one Postgres instance: a synchronous CRUD path (Client → KrakenD → SchedulerService → PG) and an asynchronous execution path (River → Worker → Restate → Target). Kafka sits to the side, consuming PG-committed execution events for auto-pause and metering.

The single CRUD seam is SchedulerService. Workers never write to scheduler.schedules (except the counter UPDATE inside the dispatch transaction); CRUD never enqueues River jobs except via river.InsertTx inside the create transaction. The two write paths are kept disjoint deliberately so that "who wrote this row" is always answerable by reading the row.

Component Roles

Component	Responsibility	Owner
SchedulerService	Restate service. CRUD RPCs. Writes `scheduler.schedules` + `river_job` rows in one transaction. Owns tenant isolation (RLS), per-project quotas, audit-column population.	Scheduler team
Cloud SQL Postgres	Source of truth for schedules, queue state, execution history. EP + regional HA + Managed Connection Pooling — Phase 1 hard requirements.	Platform team (instance / ops) + Scheduler team (schema / RLS / migrations)
River library	Postgres-backed queue. `LISTEN/NOTIFY` + 1s poll. `PeriodicJob` for cron. Lease + rescuer for crash recovery. River UI for ops visibility.	Library — vendored
River Worker	Standalone K8s `Deployment`, one fleet per `target_kind`. `HTTPDispatcher.Work` runs the request, writes the execution row, emits Kafka, returns.	Scheduler team
Target Gateway	Any Restate gateway (NotificationGateway, LLMGateway, …). One new RPC per gateway: `HandleScheduled<X>`. Idempotency dedup is inherited from Restate ingress — no application-level middleware.	Owning gateway team
Kafka topic `scheduler.executions`	Reactive event bus. Auto-pause and metering subscribe. Not on the durability path.	Scheduler team

Postgres as the Ceiling

Tens of thousands of writes/sec on a well-provisioned Cloud SQL EP instance — far above any current scheduler load profile. The bottlenecks come in this order:

Connection pool exhaustion (managed pooler config).
Vertical Cloud SQL scale. Up to 96 vCPU / 624 GB RAM on EP.
Read replicas for executions queries. Hot write path is river_job + executions inserts; admin/dashboard queries on executions offload to a replica.
Tenant sharding. When a single primary saturates, run multiple instances and shard by tenant_id at SchedulerService insert time. This is the canonical horizontal-scale path.

Multi-region active-active is explicitly out of scope — River maintainers have rejected distributed-SQL backends in riverqueue/river#104. If multi-region active-active ever lands on the roadmap, treat it as a queue-library re-evaluation, not a River extension.

The Same-Transaction Guarantee

This is the load-bearing structural change vs. v2.

Every CRUD handler — CreateJob, UpdateJob, PauseJob, ResumeJob, DeleteJob, DeleteJobsForProject — collapses into one Postgres transaction wrapped in restate.Run():

_, err = restate.Run(ctx, func(rc restate.RunContext) (string, error) {
    return s.db.WithTx(rc.Context(), pgx.TxOptions{IsoLevel: pgx.Serializable}, func(tx pgx.Tx) (string, error) {
        // 1. Set RLS scope for this transaction.
        if _, err := tx.Exec(rc.Context(),
            "SELECT set_config('app.tenant_id', $1, true), set_config('app.project_id', $2, true)",
            tenantID, projectID); err != nil { return "", err }

        // 2. Quota check inside SERIALIZABLE — two parallel CreateJobs cannot
        //    both pass the count race; PG aborts one and Restate replays.
        if quotaExceeded(tx, tenantID, projectID, s.config) {
            return "", terminalResourceExhausted("schedule quota exceeded")
        }

        // 3. INSERT INTO scheduler.schedules — audit columns + auto_pause_threshold.
        if _, err := tx.Exec(rc.Context(), insertSchedule, args...); err != nil {
            return "", err
        }

        // 4. river.InsertTx in the SAME tx. UniqueOpts.ByArgs makes Restate replays safe.
        if _, err := s.river.InsertTx(rc.Context(), tx, ScheduledDispatchArgs{...},
            &river.InsertOpts{
                ScheduledAt: firstTick,
                MaxAttempts: 10,
                UniqueOpts:  river.UniqueOpts{ByArgs: true, ByPeriod: 5 * time.Minute},
            }); err != nil {
            return "", err
        }

        return scheduleID, nil
    })
})

If the handler crashes mid-transaction, Postgres rolls back atomically, Restate replays the run, and create runs again from scratch. If it commits and the handler crashes before Restate journals the result, the next replay's UniqueOpts.ByArgs makes the second River insert a no-op while the schedule INSERT collides on its primary key — both errors are non-terminal and the end state is identical. No reconciler is needed for create-time drift. This is the structural win over v2.

The audit-only sweeper survives, but its job description inverted:

	v2 reconciler	v3 sweeper
Runs	Every 5 min	Nightly
Expected count	Non-zero (cleanup is its job)	Zero
Action on finding	Patch missing tasks, delete orphans	Page on-call — drift means a bug

Execution Path

Why HTTP Dispatch (Not Direct Restate Calls)

Workers POST against arbitrary target URLs read from scheduler.schedules.target.url. The decoupling is deliberate:

Decoupled. Scheduler doesn't import every target's proto. Adding a target = adding a URL to config.
Same contract every target. HandleScheduled<X> RPCs across NotificationGateway, LLMGateway, etc. share a body shape. Targets can be tested independently with any HTTP client.
Internal-only in v1. No KrakenD in the execution path. Worker → Restate ingress is cluster-internal (port 8080). Trust boundary in v1 is Restate ingress itself; per-target auth is a Phase 6 concern for external HTTP targets.

Idempotency: Restate Ingress Does the Work

The dispatcher always sets:

Idempotency-Key: sched:<schedule_id>:<scheduled_time_unix_millis>

This is the Edge Idempotency Key PRD's Client-Generated pattern. Properties:

Deterministic from (schedule_id, scheduled_time). Restart-safe — every River retry, rescuer redelivery, or PG-blip retry computes the same key for the same logical execution.
Distinct per cron tick. Tomorrow's 7am has a different scheduled_time → different key. No accidental cross-tick deduplication.
Restate ingress dedupes natively. Target gateways that are Restate services get dedupe for free — the partition processor atomically records the key, caches the committed response (success and terminal error) for the configured retention window, and on a same-key repeat returns the cached response or attaches to the in-flight invocation.

Phase 2 (NotificationGateway) and Phase 3 (LLMGateway) acceptance criteria become: "service is registered with Restate; ingress idempotency retention is configured to ≥ retry window." That's it — no target-side dedup table, no middleware. The scheduler is the Edge Idempotency PRD's first active Client-Generated caller in production.

Retention as Load-Bearing Config

MaxAttempts: 10 and the default exponential backoff cap the total retry window at roughly 6 hours. Restate ingress idempotency retention must be ≥ this window (configured to 24h, ~4× safety margin) so that a retry hours after the original dispatch still hits the dedupe table.

A CI deploy-time check asserts this invariant before any worker rollout:

restate_ingress_idempotency_retention_seconds  >=  river_max_attempts × river_max_backoff_seconds

With v1 settings the assertion is 86400 >= 21600 — passes with 4× margin. The check fails the deploy if either side changes without the other being raised in lockstep. A periodic job emits scheduler.dispatch.idempotency_window_safety so the same invariant is alerted on at runtime, in case Restate retention is silently shifted by central config drift.

The doc-only safeguard the PRD originally proposed was deliberately rejected — silent misconfiguration of retention turns every PG-blip retry into a duplicate medication reminder, which is exactly the failure mode Idempotency-Key exists to prevent.

Silent-Drop Risks the Dispatcher Closes

Two places where a tick can fire but leave no record. The dispatcher closes both.

PG transaction commit failure → execution row + counters lost, but HTTP side effect already happened. The dispatcher commits after the HTTP call returns. If the COMMIT fails (PG temporarily unavailable, conflict), River retries — which re-runs the HTTP dispatch. Safe only because Restate ingress dedupes on Idempotency-Key, so the duplicate dispatch is a no-op at the target. This dependency is load-bearing — losing it would mean every PG write blip becomes a duplicate side effect.
Kafka emit failure when PG row already committed. Kafka emit happens after PG commits. If Kafka is unavailable, returning err makes River retry the whole Work() — re-dispatching HTTP and trying to insert another execution row. The duplicate INSERT must be caught by a dedupe mechanism on (schedule_id, scheduled_time, attempt); the PRD captures three viable shapes (per-partition unique index, drop partitioning, or app-layer dedupe) — the implementer picks one in Phase 1.

Note: River has no equivalent of Asynq's RevokeTask (silent drop with no archive). river.JobCancel is the analog of SkipRetry (terminal, retained for forensics). One footgun fewer.

One Event Per Attempt vs. One Per Logical Execution

The dispatcher inserts an execution row on every attempt (with attempt = job.Attempt) and emits a Kafka event for every committed row. A 5xx-then-success produces two execution rows (one failed,attempt=1, one success,attempt=2) and two Kafka events. Consumers must filter:

Consumer	Filter
Auto-pause	Look at most recent N rows for a `schedule_id` ORDER BY `completed_at DESC`. Don't react to a single mid-retry failure — the latest row's status determines counter reset.
Metering	Count only `status='success'` rows, or count `(schedule_id, scheduled_time)` groups (one per logical execution).

Simpler than v2's "split into two topics" approach because the source of truth is the executions table — consumers can SQL-query it whenever they need ground truth, not just react to the Kafka stream.

Postgres Schema Highlights

Full DDL lives in the PRD. The shape worth knowing:

CREATE TABLE scheduler.schedules (
    schedule_id        TEXT PRIMARY KEY,
    tenant_id          TEXT NOT NULL,
    project_id         TEXT NOT NULL,
    owner_subject      TEXT NOT NULL,                -- 'user:alice'
    schedule_type      TEXT NOT NULL CHECK (schedule_type IN ('cron','once','recurring_interval')),
    cron_expression    TEXT,
    timezone           TEXT DEFAULT 'UTC',
    target             JSONB NOT NULL,                -- {url, method, kind, payload}
    state              TEXT NOT NULL DEFAULT 'active' CHECK (state IN ('active','paused','deleted')),
    auto_pause_threshold INTEGER NOT NULL DEFAULT 10
        CHECK (auto_pause_threshold = 0 OR auto_pause_threshold BETWEEN 3 AND 100),
    created_by_subject TEXT NOT NULL,
    paused_by_subject  TEXT, paused_reason TEXT, paused_at TIMESTAMPTZ,
    -- (resumed_/deleted_/updated_ pairs as well)
    next_trigger_at    TIMESTAMPTZ,
    trigger_count      BIGINT NOT NULL DEFAULT 0,
    failure_count      BIGINT NOT NULL DEFAULT 0
);

CREATE TABLE scheduler.executions (
    execution_id   BIGSERIAL PRIMARY KEY,
    schedule_id    TEXT NOT NULL REFERENCES scheduler.schedules(schedule_id) ON DELETE CASCADE,
    tenant_id      TEXT NOT NULL,           -- denormalized for query perf
    project_id     TEXT NOT NULL,
    scheduled_time TIMESTAMPTZ NOT NULL,
    started_at     TIMESTAMPTZ NOT NULL,
    completed_at   TIMESTAMPTZ NOT NULL,
    status         TEXT NOT NULL CHECK (status IN ('success','failed','malformed')),
    http_status    INTEGER,
    attempt        INTEGER NOT NULL,
    duration_ms    INTEGER NOT NULL,
    target_kind    TEXT NOT NULL,
    error          TEXT,
    river_job_id   BIGINT
) PARTITION BY RANGE (completed_at);

Notes:

RLS enforced on both tables. SchedulerService sets app.tenant_id and app.project_id per-transaction via set_config(); cross-tenant ops use a dedicated BYPASSRLS role, never the application role. RLS is defense-in-depth on top of OpenFGA, not a replacement.
scheduler.executions is range-partitioned monthly via pg_partman with 90-day default retention (configurable per-tenant later). Partition drop is metadata-only — constant-cost cleanup regardless of execution volume.
Audit columns are denormalized — only the latest paused_by_subject is kept, not full pause/resume history. Engineering forensics rely on Loki structured logs; tenant-facing audit history will arrive when Retraced lands platform-wide (Phase 5 — ~1-2 days of Retraced.Publish() plumbing, no schema change).

Schedule Types Map to River

Type	Fields used	River mapping
`cron`	`cron_expression`, `timezone`	Registered as a `river.PeriodicJob` at worker startup. Leader-elected `periodic_job_enqueuer` inserts a fresh `river_job` row per tick.
`once`	`scheduled_at`	Single `river.Client.InsertTx(args, InsertOpts{ScheduledAt: scheduled_at})` at create time.
`recurring_interval`	`interval_seconds`	Worker computes next tick at end of dispatch and inserts a new River job in a separate transaction after the execution row commits. On worker crash mid-window, River retries the whole `Work()`; Restate's `Idempotency-Key` dedupes the duplicate dispatch and the next-tick insert is re-attempted.

Cron Edits Land at Next Worker Deploy

UpdateJob to cron_expression takes effect when loadPeriodicJobsFromDB runs — i.e., next worker Deployment restart. No hot-reload in v1. Hot-reload (LISTEN/NOTIFY watcher + PeriodicJob re-registration) is ~3-5 days of work for a non-load-bearing UX feature; deferred until cron edit volume becomes a recurring support question.

The documented immediate-effect workaround — both in API docs and in the UpdateJob response itself — is PauseJob on the existing schedule + CreateJob for a new one with the desired cron. The UpdateJob response surfaces effective_at: "next_worker_deployment" so consuming UIs can communicate this clearly.

Multi-Tenancy & Quotas

Quota	Default	Enforcement
`max_schedules_per_project`	500	`CreateJob` returns `RESOURCE_EXHAUSTED`. Counted via RLS-scoped query inside the SERIALIZABLE transaction to prevent race conditions.
`max_schedules_per_user`	50	Limits individual user's schedules within a project.
`min_schedule_interval`	60s	Reject cron more frequent than 1/min — River poll cadence is 1s but sub-minute precision is not a stable guarantee.

Project deletion runs in a single Postgres transaction:

-- 1. Cancel pending River jobs for this project
UPDATE public.river_job SET state = 'cancelled', finalized_at = now()
WHERE kind = 'scheduled.dispatch' AND state IN ('available','scheduled','retryable')
  AND (args->>'tenant_id') = $1 AND (args->>'project_id') = $2;

-- 2. Soft-delete schedule rows
UPDATE scheduler.schedules SET state = 'deleted', updated_at = now()
WHERE tenant_id = $1 AND project_id = $2 AND state != 'deleted';

Both steps share a transaction, so partial deletion is impossible.

Auto-Pause is Per-Schedule, Configurable

Different schedules in the same project legitimately want different thresholds (medication reminders vs. marketing digests). Stored on scheduler.schedules.auto_pause_threshold:

0 — disabled (manual pause only)
3..100 — valid range
10 — default

Per-tenant defaults are not built in v1. Add only if a tenant explicitly asks. The auto-pause Kafka consumer reads the threshold per-schedule and filters on (attempt, status) to avoid reacting to mid-retry failures — only the final failure of a fully-retried dispatch counts toward the consecutive-failure counter.

Security — Restate Ingress Is the Trust Boundary

In v1, all dispatch targets are Restate services reached via Restate ingress. The trust boundary is Restate ingress itself, not application-level signatures:

Worker → Restate ingress is internal cluster traffic (port 8080, no KrakenD in this path).
Restate ingress validates the request shape and routes to the registered target service.
Restate ingress's native Idempotency-Key dedup prevents replay-based attacks within the retention window.

No HMAC signing in v1. The PRD's earlier draft called for X-Schedule-Signature; it was dropped because no v1 target verifies it (all v1 targets rely on ingress trust) and operational cost of OpenBao key rotation was non-trivial for a 2-person team. HMAC re-enters as one option in target.auth.kind for external HTTP targets if Phase 6 is ever picked up — alongside bearer, oauth2_client_credentials, and mtls.

On-Behalf-Of Authorization

Scheduled actions use permission intersection: service:scheduler scope AND the owner_subject's current permissions. If the original user loses project access, their schedules fail at execution time — the worker's RLS-scoped target read returns no row and the dispatch is JobCancel'd.

Component Inventory

New / Extended Components

Component	Purpose	PRD section
`services/scheduler/v1/`	New Restate service. CRUD RPCs (`CreateJob`, `GetJob`, `ListJobs`, `UpdateJob`, `PauseJob`, `ResumeJob`, `DeleteJob`, `DeleteJobsForProject`, `ListExecutions`).	§SchedulerService API
Cloud SQL EP + regional HA + Managed Connection Pooling	Source of truth for everything. Provisioned by platform team. Phase 1 hard requirement.	§Phase 1
`scheduler.schedules` migration	Schedule definitions + denormalized audit columns + `auto_pause_threshold`.	§Postgres Schema
`scheduler.executions` migration	Range-partitioned via `pg_partman`.	§Postgres Schema
RLS policies	`tenant_id` + `project_id` isolation; per-tx via `SET LOCAL`.	§Multi-Tenancy via Row-Level Security
River library + River's own migrations	Queue, cron, lease, rescuer. Migrations applied at deploy time.	§River Worker — HTTP Dispatcher
River Worker (`HTTPDispatcher`)	Standalone K8s `Deployment`, one fleet per `target_kind`.	§River Worker
CI deploy-time idempotency-retention check	Asserts `Restate retention >= MaxAttempts × max-backoff` before any worker rollout.	§Idempotency Retention as Load-Bearing Config
Audit-only reconciliation cron	Nightly sweeper, expected-zero.	§Reconciliation — Audit-Only
*KrakenD `/scheduler/` routes**	Scheduler team adds directly (2-person team — no formal cross-team handoff).	§Phase 1
Auto-pause Kafka consumer	Reads `auto_pause_threshold` per schedule; calls `PauseJob(reason="auto:consecutive_failures")`.	§Phase 4, §Auto-Pause
Metering Kafka consumer	Per-execution Lago events; dedups on `(schedule_id, scheduled_time)`.	§Phase 4

Integrated Components (No Changes Required)

Component	Role
Restate Go SDK	`restate.Run()` for journaling the create transaction; `restate.WithIdempotencyKey` is unused — the dispatcher sets the canonical IETF header directly.
Restate ingress	Native `Idempotency-Key` dedup on the partition processor. Handles cache hit on retries automatically.
`pkg/tenancy`	`TenantFromContext` / `ProjectFromContext` for setting RLS scope.
OpenFGA	Authorization for CRUD: `member` for read, `admin` for write/delete. RLS is defense-in-depth on top, not a replacement.
Loki	Structured-log substrate for engineering forensics; pre-Retraced audit evidence trail.

Observability Additions

Metrics emitted from SchedulerService and the Worker. Labels include tenant_id + project_id so per-project dashboards work day one.

Metric	Type	Description
`scheduler.schedules.total`	Gauge	Schedule count by state, target_kind, project
`scheduler.dispatch.total`	Counter	Dispatch invocations by target_kind, result
`scheduler.dispatch.duration_ms`	Histogram	Dispatch latency
`scheduler.dispatch.retries`	Counter	River retries by attempt number
`scheduler.dispatch.idempotency_window_safety`	Gauge	Runtime check against the retention invariant
`scheduler.queue.available`	Gauge	River pending jobs by queue (drives KEDA HPA)
`scheduler.queue.scheduled`	Gauge	River future-scheduled jobs by queue
`scheduler.audit.findings`	Counter	Reconciliation sweeper findings — expected zero
`pg.connection_pool.in_use`	Gauge	Cloud SQL pooler — connections in use
`pg.replication.lag_seconds`	Gauge	HA replica lag (alert if > 10s)

Alerts at runtime: pending queue depth > 1000 for 5m (warn), dispatch error rate > 10% for 5m (page), reconciliation finding any (page), Cloud SQL HA replica lag > 10s for 1m (warn), connection pool exhaustion any (page), River rescuer requeues > 50/hr (investigate).

Rollout Phases

The 5-phase plan from the PRD, plus a deliberately-not-on-roadmap Phase 6 placeholder. Phase 1 timeline depends on platform-team capacity to provision Cloud SQL — confirm provisioning lead time before committing to the 2-week estimate.

Phase	Scope	Estimate	Status
1. SchedulerService + Cloud SQL + River	Cloud SQL EP + HA + Managed Pooling provisioned by platform team (smallest class, scale reactively); migrations for `scheduler.schedules` (with audit columns + `auto_pause_threshold`), `scheduler.executions` (`pg_partman`), RLS policies, River's own migrations; CRUD subset (`CreateJob`, `GetJob`, `ListJobs`, `DeleteJob`) with one-PG-tx + SERIALIZABLE; River Worker `Deployment` per `target_kind`, `MaxAttempts=10`, ~6h retry window, canonical `Idempotency-Key` header; CI deploy-time check (retention ≥ retry window); KrakenD `/scheduler/` routes; River UI; audit-only reconciliation cron. No HMAC* in Phase 1.	2 weeks	Not started
2. NotificationGateway target	`HandleScheduledNotification` RPC; configure Restate ingress retention ≥ retry window; Novu dispatch; end-to-end test with duplicate-`Idempotency-Key` dedup verification. No application-level dedup middleware required.	1 week	Not started
3. LLMGateway target	`HandleScheduledMessage` RPC; same idempotency story as Phase 2; message injection into conversation thread; end-to-end test.	1 week	Not started
4. Execution history + full CRUD	Auto-pause Kafka consumer (per-schedule `auto_pause_threshold`, default 10, range 0/3-100); metering Kafka consumer (Lago); `ListExecutions` (RLS-scoped queries); `UpdateJob` (with `effective_at: "next_worker_deployment"` response field), `PauseJob`, `ResumeJob`, `DeleteJobsForProject`.	1 week	Not started
5. Maintenance + Retraced	OTel + Grafana dashboard (Cloud SQL + River + dispatch, including the safety-margin metric); worker drain procedure; PG minor-version upgrade runbook (sub-second blip on EP+HA); PG major-version upgrade via DMS-managed cutover (~30s blackout, ~5y cadence); load test against realistic Cloud SQL sizing; Retraced integration (~1-2 days `Retraced.Publish()` plumbing in CRUD handlers) when the platform-wide rollout lands.	—	Not started — Retraced timing depends on platform

Phase Dependencies

Phase depends on	Reason
Phase 1 depends on platform team's Cloud SQL provisioning	Phase 1 cannot start without the instance. Confirm lead time before committing to the 2-week estimate.
Phase 1 depends on Projects PRD	RLS scope `tenant_id` + `project_id` is established by Projects.
Phase 1 depends on Edge Idempotency PRD	Restate ingress dedup is the v1 trust+correctness boundary for retries.
Phases 2 and 3 can run in parallel	Both consume Phase 1's substrate; share the same idempotency story; ordering between them is a team-capacity decision, not a dependency.
Phase 4 depends on Phases 2/3	Auto-pause + metering need real targets to observe; `ListExecutions` is most useful when there are executions to list.
Phase 5 depends only on Phase 1	Observability + runbooks + load test are about the substrate, not the tenants. Retraced lands separately based on platform availability.
Phase 6 has no dependency arrow into 1-5	Phase 1 deliberately invests no scaffolding for it (see Phase 6 placeholder).

Future: Phase 6 (External HTTP Targets) — Placeholder Only

Not committed. No customer ask, no dated roadmap entry. The section exists in the PRD so the design space is visible — i.e., if/when external webhook scheduling is requested, the extension shape is clear and the v1 architecture doesn't need to change.

If picked up, the additions are well-scoped:

Per-target auth on the schedule doc — target.auth.kind ∈ {none, hmac, bearer, oauth2_client_credentials, mtls} with secret_ref to OpenBao.
target.headers map for per-schedule custom headers; forbidden headers (Idempotency-Key, X-Schedule-*, Host) stripped or rejected at CreateJob.
Per-tenant egress allowlist. target.url host must match a configured allowlist — without this, CreateJob is an SSRF / data-exfiltration primitive. SOC2 hard requirement.
Idempotency becomes the target's problem. For non-Restate targets, Idempotency-Key is advisory — the worker still sends it, but the target may or may not honor it. At-least-once duplicates from worker crashes/retries are the contract.
Per-target rate limiting. Trivial extension of queue partitioning — one queue per target host.

When/if this phase is picked up, expect roughly +50 lines of PRD, additive schema fields on target, and an admin-time allowlist policy. None of the cron / delivery / scaling decisions in this PRD change.

Out of Scope for v1

Multi-region active-active scheduling. Single-region single-cluster is an explicit accepted constraint (River is Postgres-only by design — see riverqueue/river#104). Tenant sharding is the canonical horizontal-scale path beyond single-instance vertical scale.
Hot-reload of cron schedule edits. UpdateJob to cron_expression takes effect at next worker Deployment restart. Documented Pause+Create-new workaround for immediate-effect needs.
HMAC signing/verification on dispatched HTTP requests. Restate ingress is the v1 trust boundary; HMAC re-enters as one option in target.auth.kind if Phase 6 is picked up.
Tenant-facing audit log UI / append-only audit-events table. Phase 1 is denormalized columns + Loki only. Retraced lands in Phase 5; full event-log auditing arrives with it. Pre-Retraced SOC2 gap is an accepted risk; if Retraced timeline slips materially, revisit.
Per-tenant default auto_pause_threshold. Per-schedule only in v1.
Per-tenant Cloud SQL instances. One scheduler-scoped instance for v1; tenant sharding only when a single primary saturates.
Broader Firestore → Postgres migration. Cloud SQL is provisioned because River requires it. Other domains stay on Firestore unless they make a separate per-domain decision.
External HTTP destinations. Phase 6 placeholder — see above.
_platform-scoped schedules. No org-wide schedules in v1. Lands with the first _platform consumer's PRD.

Cross-References

Scheduled Jobs PRD v3 — the v3 design in full, including DDL, CRUD flow, dispatcher, scaling levers, and open questions
Edge Idempotency Roadmap — sibling roadmap; this design is the first active Client-Generated caller of the canonical Idempotency-Key header
Tenant & Project Lifecycle Roadmap — multi-tenant substrate the RLS scope is built on
GCP Eventarc → Kafka Pipeline Roadmap — sibling pattern for async GCP-originated events; explains why scheduler dispatches deliberately do not flow through Kafka
RequestContext Refactor Roadmap — typed-subject and protovalidate guarantees owner_subject and audit columns inherit
River library docs — particularly Periodic Jobs, Unique Jobs, Migrations
Cloud SQL Enterprise Plus — near-zero-downtime maintenance tier used by this design
Cloud SQL Managed Connection Pooling — Phase 1 hard requirement
pg_partman — partition management for executions retention
IETF draft-ietf-httpapi-idempotency-key-header-07 — the standard the dispatcher conforms to

Glossary​

Why v3 — What Got Rewritten​

Cloud SQL Is Pragmatic, Not a Migration Commitment​

Architecture​

Component Roles​

Postgres as the Ceiling​

The Same-Transaction Guarantee​

Execution Path​

Why HTTP Dispatch (Not Direct Restate Calls)​

Idempotency: Restate Ingress Does the Work​

Retention as Load-Bearing Config​

Silent-Drop Risks the Dispatcher Closes​

One Event Per Attempt vs. One Per Logical Execution​

Postgres Schema Highlights​

Schedule Types Map to River​

Cron Edits Land at Next Worker Deploy​

Multi-Tenancy & Quotas​

Auto-Pause is Per-Schedule, Configurable​

Security — Restate Ingress Is the Trust Boundary​

On-Behalf-Of Authorization​

Component Inventory​

New / Extended Components​

Integrated Components (No Changes Required)​

Observability Additions​

Rollout Phases​

Phase Dependencies​

Future: Phase 6 (External HTTP Targets) — Placeholder Only​

Out of Scope for v1​

Cross-References​