docs/dynamic-agent-spawning.md

Dynamic agent spawning

How Randal dispatches ephemeral subagents for complex work.

Dynamic Agent Spawning Architecture

Vision

The natural evolution of Randal's posse model: instead of pre-deploying N specialized agents, a single "front-door" gateway agent spawns specialists on demand via Railway API.

This document describes the later Railway-backed spawning phase. Phase 1 local runtime orchestration now establishes the transport-neutral child backend contract, root-owned child tracker registry, structured context packet flow, canonical memory scope inheritance, and root rollups. Railway spawning builds on those seams instead of introducing a separate orchestration model.

Current Model (Static Posse)

Gateway ──→ Agent A (always running)
         ──→ Agent B (always running)
         ──→ Agent C (always running)
  • All agents run 24/7 regardless of demand
  • ~$225/month for 10 agents on Railway
  • Fixed expertise topology — adding a new specialty requires manual deploy
  • Idle agents consume resources even when no work is available

Target Model (Dynamic Spawning)

Gateway ──→ Front-door Agent (always running)
              ├──→ Spawns Specialist A (on demand, auto-teardown)
              ├──→ Spawns Specialist B (on demand, auto-teardown)
              └──→ Uses existing idle peers (if available)
  • Only the front-door agent runs permanently
  • Specialists spin up when needed, tear down when idle
  • Cost scales with actual usage
  • Expertise topology is emergent — spawn what you need, when you need it

How It Builds on the Unified Invocation Layer

The routeTask() function in the mesh router (packages/mesh/src/router.ts) currently returns:

  • A RoutingDecision — route to an existing idle agent with sufficient expertise
  • null — no suitable agent available, run locally

With dynamic spawning, a third option emerges:

  • A SpawnDecision — no idle specialist available → create one from an archetype template

The POST /posse/job and POST /job endpoints already handle delegation transparently via root-owned child tracking. For Railway-backed workers, the routing logic can eventually choose the railway-worker child backend; if routeTask() returns a spawn decision, the gateway:

  1. Calls the Railway API to create a service from an archetype template
  2. Waits for the service to become healthy (poll /health)
  3. The new agent auto-registers in Meilisearch on startup
  4. Delegates the task through the same child backend / child tracker flow used by local workers

From the user's perspective (Discord, HTTP), nothing changes — the gateway stays root-job-centric and handles spawn/delegate/teardown behind the same child backend contract.

Railway Backend Integration

The orchestration core should only know about the transport-neutral child backend contract. Railway-specific concerns live behind a provider lifecycle interface:

interface RailwayWorkerLifecycle {
  spawn(...)
  checkHealth(...)
  teardown(...)
}

The railway-worker child backend plugs that lifecycle into the same root/child lineage, context packet, normalized event, and rollup surfaces already used by local-opencode.

Railway API Integration

Service Creation

// Railway GraphQL API
// POST https://backboard.railway.com/graphql/v2
mutation {
  serviceCreate(
    input: {
      projectId: $projectId
      name: "specialist-{archetype}-{uuid}"
      source: { image: "ghcr.io/drewbietron/randal:latest" }
    }
  ) { id }
}

The gateway needs a RAILWAY_API_TOKEN with project-level permissions and the RAILWAY_PROJECT_ID for the target project.

Environment Configuration

Each spawned agent gets environment variables from its archetype template plus inherited parent config:

VariableSourcePurpose
AGENT_NAMEArchetypeUnique agent identity
AGENT_ROLEArchetypeRole for mesh routing (e.g., product-engineering)
AGENT_EXPERTISEArchetypeComma-separated expertise tags
RANDAL_SKIP_MEILISEARCHfalseUses shared Meilisearch instance
MEILISEARCH_URLParentShared Meilisearch endpoint
MEILISEARCH_MASTER_KEYParentShared Meilisearch auth
OPENROUTER_API_KEYParentLLM API access
RANDAL_SPAWNED_BYParent nameTracks lineage for teardown
RANDAL_POSSE_NAMEParentJoins the same posse
RANDAL_AUTO_TEARDOWNtrueEnables idle self-teardown

Health Polling

After spawning a service, poll GET /health every 5 seconds for up to 3 minutes:

async function waitForHealth(endpoint: string, timeout = 180_000): Promise<boolean> {
  const start = Date.now();
  while (Date.now() - start < timeout) {
    try {
      const res = await fetch(`${endpoint}/health`);
      if (res.ok) return true;
    } catch {
      // Not ready yet
    }
    await sleep(5_000);
  }
  return false; // Spawn timeout
}

Railway typically takes 30–90 seconds for first deploy of a cached Docker image.

Domain Assignment

Railway auto-assigns a *.up.railway.app domain to each service. The spawned agent registers itself in Meilisearch on startup (existing behavior in packages/gateway/src/gateway.ts), so the front-door discovers it via the existing mesh registry — no manual domain configuration needed.

Service Template Registry

Agent archetypes are stored in a Meilisearch index agent-archetypes-{posse}:

{
  "id": "product-engineering",
  "name": "Product Engineering Specialist",
  "role": "product-engineering",
  "expertise": ["react", "typescript", "nodejs", "nextjs", "tailwind", "prisma"],
  "specialization": "Full-stack web development with modern frameworks",
  "dockerImage": "ghcr.io/drewbietron/randal:latest",
  "resources": {
    "memory": "2Gi",
    "cpu": "1"
  },
  "warmupTime": "60s",
  "model": "anthropic/claude-sonnet-4",
  "maxConcurrentJobs": 2
}

The routeTask() function would query this index when no existing agent matches, selecting the archetype with the highest expertise overlap for the task's domain.

Archetype Examples

ArchetypeRoleKey ExpertiseTypical Use
product-engineeringFull-stack devReact, TypeScript, Node.jsFeature implementation
security-complianceSecurity reviewOWASP, auth, encryptionSecurity audits
data-engineeringData pipelinesSQL, ETL, analyticsData tasks
infra-devopsInfrastructureDocker, Railway, CI/CDDeploy and infra
research-analystResearchWeb search, synthesisDeep research tasks

Idle Teardown

Spawned agents should not run indefinitely. The teardown lifecycle:

  1. Monitor: If an agent has 0 active jobs for N minutes (default: 15), mark as idle-candidate
  2. Grace period: Send a "preparing to teardown" notification to the parent gateway. Wait 2 minutes for new work to arrive
  3. Teardown sequence:
    • Deregister from mesh (remove from Meilisearch posse-registry-{posse} index)
    • Delete the Railway service via GraphQL API
    • Remove any Meilisearch records (memory, chat history) scoped to the agent
  4. Never tear down the front-door agent — check RANDAL_SPAWNED_BY existence

Teardown can be initiated by:

  • The spawned agent itself (self-monitoring idle time via RANDAL_AUTO_TEARDOWN)
  • The parent gateway (centralized idle monitoring)
  • Budget controls (monthly limit reached)

Cost Controls

interface SpawnCostConfig {
  /** Max spawned agents running simultaneously (default: 3) */
  maxConcurrentSpawns: number;

  /** Railway cost cap per hour in USD (default: 5) */
  maxHourlyBudget: number;

  /** Hard monthly spending limit in USD */
  maxMonthlyBudget: number;

  /** Always prefer existing idle agents over spawning new ones (default: true) */
  preferIdle: boolean;

  /** Minimum seconds between spawns of the same archetype (default: 300) */
  cooldownSeconds: number;

  /** Minutes before idle teardown (default: 15) */
  idleTimeoutMinutes: number;
}

Cost tracking:

  • Railway provides usage metrics via their API
  • The gateway tracks spawn/teardown events with timestamps
  • Estimated cost = (agent runtime hours) × (Railway per-hour rate for resource tier)
  • When maxMonthlyBudget is 80% consumed, switch to "prefer idle only" mode
  • When 100% consumed, reject all spawn requests and route to existing agents or local

Failure Modes

FailureDetectionRecovery
Spawn timeout (Railway API slow)Health poll exceeds 3 minDelete the service, fall back to local execution
Health check failure (agent won't start)Health returns non-200 after startupDelete the service, log error with Railway logs, fall back to local
Railway API rate limit429 responseQueue spawn requests, retry with exponential backoff (max 60s)
Budget exceededCost tracking exceeds limitsReject spawn, route to existing agents or execute locally
Orphaned agents (gateway crashes before teardown)Heartbeat monitor detects stale agents (no heartbeat for 10 min)Auto-teardown via a background sweep job
Recursive spawningAgent tries to spawn another agentGuard: spawned agents (RANDAL_SPAWNED_BY set) cannot call spawn API

Orphan Detection

A background cron job (every 5 minutes) on the front-door gateway:

  1. Query Meilisearch for all agents with RANDAL_SPAWNED_BY set
  2. Check each agent's last heartbeat timestamp
  3. If last heartbeat > 10 minutes ago and agent is not in the active trackers map:
    • Attempt health check — if responsive, just update records
    • If unresponsive, initiate teardown sequence

Migration Path

PhaseDescriptionStatus
TodayStatic posse with pre-deployed agentsCurrent
Phase 1–3Unified invocation layer with delegation trackingImplemented (feat/unified-posse-invocation)
Phase 4aAdd SpawnDecision to routeTask(), Railway API client, basic spawn/teardownFuture
Phase 4bArchetype registry, cost controls, idle monitoringFuture
Phase 4cSelf-optimizing: track archetype usage frequency, pre-warm popular onesFuture

Phase 4a Details

Minimal additions to the existing codebase:

  1. New file: packages/mesh/src/spawner.ts — Railway API client, spawn/teardown/health
  2. Modify: packages/mesh/src/router.tsrouteTask() returns RoutingDecision | SpawnDecision | null
  3. Modify: packages/gateway/src/channels/http.tsPOST /posse/job handles SpawnDecision (spawn → wait → delegate)
  4. New file: packages/gateway/src/archetype-registry.ts — Meilisearch-backed archetype CRUD
  5. Modify: packages/gateway/src/gateway.ts — orphan detection cron, cost tracking

Phase 4b Details

  • Meilisearch index for archetypes with CRUD endpoints
  • Cost dashboard endpoint (GET /posse/costs)
  • Idle monitoring loop integrated into gateway heartbeat
  • Config file support for spawn cost limits

Phase 4c Details

  • Track which archetypes are spawned most frequently
  • Pre-warm popular archetypes during business hours
  • Auto-suggest new archetypes based on task routing misses
  • A/B test archetype configurations for cost vs. performance

Security Considerations

  • API key inheritance: Spawned agents inherit API keys from the parent's Railway project environment variables. They share the same LLM and Meilisearch access.
  • Principle of least privilege: Consider per-agent scoped tokens for Meilisearch (read-only for memory search, write for own agent records only). Railway service tokens can be scoped to the spawned service.
  • Recursion guard: Spawned agents (those with RANDAL_SPAWNED_BY set) must not be able to spawn additional agents. The spawn API should check this and reject.
  • Network isolation: All agents in the same Railway project share a private network. External access is via Railway's public domain only.
  • Secret rotation: When parent rotates API keys, spawned agents need restart or re-deploy. Consider a shared secret store (Railway shared variables) rather than per-service env vars.
  • Audit trail: All spawn/teardown events should be logged to Meilisearch with timestamp, archetype, cost, duration, and initiating job ID.