Agentic AI & Platform Engineering

Legacy Workflow Platform → Hatchet & Agentic AI

Migrating a brittle, script-and-cron workflow stack to durable orchestration on Hatchet — then layering secured agentic AI for decision steps, with clear observability, retries, and replay from day one.

Lost Edges Security agentic-ai orchestration migration hatchet durability
Legacy Workflow Platform → Hatchet & Agentic AI
From fragile pipelines to durable workflows and guarded agents — one platform for queues, checkpoints, and production-grade operations.
Status: Active

Overview

This project is a platform migration: moving an organization off an aging workflow-based system (ad hoc scripts, brittle cron chains, and opaque handoffs) onto Hatchet — a developer platform built around durable tasks, workers, and composable workflows — and then evolving selected stages into secured, agentic AI flows where judgment adds value but must not bypass controls.

Hatchet is described in its documentation as supporting mission-critical AI agents, durable workflows, and background tasks, with tasks and workflows defined as code and durably persisted so teams can retry, replay, and debug long-running work instead of treating failures as one-off incidents. It can run on Hatchet Cloud or be self-hosted (MIT-licensed, with PostgreSQL as a core dependency for many deployments). That combination made it a strong fit when the goal was robustness, efficiency, and security without reinventing a queue-and-state engine in-house.

The Legacy Problem

The previous system had predictable failure modes:

  • Implicit state — progress lived in filesystem crumbs, email, or ad hoc database rows, so “what happened to job 418?” was often unanswerable.
  • Weak failure semantics — retries doubled work, skipped steps, or left downstream consumers half-updated.
  • No first-class parallelism model — scaling meant copying VMs or turning cron knobs, not fair scheduling with backpressure.
  • AI bolt-ons — experiments with LLMs lived outside the workflow graph, without versioning, evaluation, or audit parity with the rest of production.

None of that is unique; it is exactly why durable orchestration and disciplined agent design belong in the same conversation.

Why Hatchet

We evaluated Hatchet against “yet another queue plus homegrown orchestration.” The differentiator for this engagement was durability as a product primitive: work is recorded in a durable event log, which supports retries, replays, and more complex patterns (DAG-style workflows, sleeps, child spawning, and related features documented in the Hatchet user guide). For AI agents, the docs emphasize checkpointing so agents can resume after errors — aligned with how we wanted to run long-lived, multi-step automation without fragile in-memory state.

Engineering constraints also mattered: SDK support across the languages the client already uses, a path to self-host for data-residency requirements, and observability (logging, OpenTelemetry, metrics) so SRE practices carry over instead of starting from zero.

Target Architecture

At a high level:

  1. Workers (long-running processes in the customer’s environment) execute tasks — single units of work registered with Hatchet — with clear resource and concurrency limits.
  2. Durable workflows compose tasks into pipelines with dependencies, retries, and checkpoints — replacing the old “cron calls script A then maybe script B” graphs with versioned code and explicit triggers (API, events, schedules, or bulk runs, depending on the integration surface).
  3. Agentic stages replace only the steps that benefit from language understanding or fuzzy matching: classification, extraction, routing, or draft generation — always behind tool contracts, schema-validated outputs, and escalation paths for low-confidence results.

Security is treated as a cross-cutting layer: least-privilege credentials per worker pool, network segmentation between data planes, PII minimization in prompts and logs, and immutable audit metadata for agent decisions where compliance requires it.

Migration Process

We ran the cutover in five phases. Durations below are indicative for a mid-sized platform; your mileage depends on integration count, data gravity, and how clean the legacy semantics were.

Phase 1 — Discovery and semantics (2–3 weeks)

Inventory every workflow path: triggers, inputs, side effects, SLAs, and failure modes. For each path we answered: What is the minimum durable state we must record to safely retry? Legacy jobs that relied on “it ran at 2am” without idempotency were redesigned before any code moved.

Phase 2 — Hatchet foundations (3–5 weeks)

Stand up Hatchet (cloud or self-hosted per policy), define task contracts, worker pools, retry and timeout policies, and baseline observability (structured logs, traces, dashboards). Migrate one non-critical pipeline end-to-end to prove networking, secrets, and deploy patterns.

Phase 3 — Parallel run and parity (4–8 weeks)

Port remaining workflows in priority order. Legacy and Hatchet paths run in parallel with shadow or sampled comparison on outputs until statistical parity and error budgets are met. This is where most calendar time goes — not because Hatchet is slow, but because business logic archaeology is slow.

Phase 4 — Agentic layers (4–8 weeks, overlapping)

Introduce agent-backed tasks only where evaluation harnesses exist: golden sets, regression prompts, and red-team probes for injection and tool misuse. Roll out behind feature flags and human approval for high-risk branches.

Phase 5 — Cutover and hardening (2–4 weeks)

Traffic flips to Hatchet as source of truth, freeze windows for risky periods, bulk retry playbooks validated, and runbooks updated for on-call. Legacy cron is retired or reduced to a narrow compatibility shim with an explicit sunset date.

Benefits

AreaOutcome
ReliabilityDurable persistence and replay mean fewer “unknown failed” incidents; retries are safe by design when tasks are idempotent.
EfficiencyFair scheduling, concurrency controls, and worker slot management reduce over-provisioning and thundering herds compared to naive cron fan-out.
VelocityWorkflows as code are reviewable, testable, and versioned like the rest of the application stack.
SecurityCentralized execution with explicit credentials and observability beats scattered scripts with shared SSH keys.
AI qualityAgents sit inside the same durability and monitoring model — not as shadow IT beside it.

Indicative Timeline

For planning and stakeholder communication we use a single Gantt-style summary (weeks are calendar, not pure engineering):

  • Weeks 1–3: Discovery and danger-list legacy paths.
  • Weeks 4–8: Hatchet platform hardening + first production workflow.
  • Weeks 9–18: Parallel run, parity testing, and bulk of migrations.
  • Weeks 12–20 (overlap): Agentic task design, evals, and staged rollout.
  • Weeks 19–22: Cutover, decommission legacy orchestration, post-cutover tuning.

Parallelization and staff loading can compress the calendar; data migration and regulatory review are the usual extenders.

Security and Compliance Notes

Self-hosted Hatchet still sits in your threat model: protect PostgreSQL, broker credentials, and worker networks as you would any control plane. We document data residency, retention for durable logs, and who can trigger replays — replay is powerful for debugging but must be permissioned.

Current Status

Active — the core platform is live on Hatchet; additional workflows and agent-assisted stages continue to ship under the same durability and review bar. If you are facing a similar legacy orchestration debt, we are happy to map a shorter discovery phase against your real job graph and integration list.

  • Durable execution first. Every task and workflow step is persisted for retries, debugging, and replay — so partial failures stop being mystery meat in log files.
  • Agent steps behind explicit guardrails. LLM-driven decisions run inside bounded tools, structured outputs, and human-in-the-loop checkpoints where policy requires it.
  • Operational clarity. Workers, concurrency, and backpressure are modeled explicitly — teams see what ran, what is waiting, and what to replay without SSH archaeology.
← Back to Projects April 25, 2026