Dynamic Agent Spawning Architecture

Vision

The natural evolution of Randal's posse model: instead of pre-deploying N specialized agents, a single "front-door" gateway agent spawns specialists on demand via Railway API.

This document describes the later Railway-backed spawning phase. Phase 1 local runtime orchestration now establishes the transport-neutral child backend contract, root-owned child tracker registry, structured context packet flow, canonical memory scope inheritance, and root rollups. Railway spawning builds on those seams instead of introducing a separate orchestration model.

Current Model (Static Posse)

Gateway ──→ Agent A (always running)
         ──→ Agent B (always running)
         ──→ Agent C (always running)

All agents run 24/7 regardless of demand
~$225/month for 10 agents on Railway
Fixed expertise topology — adding a new specialty requires manual deploy
Idle agents consume resources even when no work is available

Target Model (Dynamic Spawning)

Gateway ──→ Front-door Agent (always running)
              ├──→ Spawns Specialist A (on demand, auto-teardown)
              ├──→ Spawns Specialist B (on demand, auto-teardown)
              └──→ Uses existing idle peers (if available)

Only the front-door agent runs permanently
Specialists spin up when needed, tear down when idle
Cost scales with actual usage
Expertise topology is emergent — spawn what you need, when you need it

How It Builds on the Unified Invocation Layer

The routeTask() function in the mesh router (packages/mesh/src/router.ts) currently returns:

A RoutingDecision — route to an existing idle agent with sufficient expertise
null — no suitable agent available, run locally

With dynamic spawning, a third option emerges:

A SpawnDecision — no idle specialist available → create one from an archetype template

The POST /posse/job and POST /job endpoints already handle delegation transparently via root-owned child tracking. For Railway-backed workers, the routing logic can eventually choose the railway-worker child backend; if routeTask() returns a spawn decision, the gateway:

Calls the Railway API to create a service from an archetype template
Waits for the service to become healthy (poll /health)
The new agent auto-registers in Meilisearch on startup
Delegates the task through the same child backend / child tracker flow used by local workers

From the user's perspective (Discord, HTTP), nothing changes — the gateway stays root-job-centric and handles spawn/delegate/teardown behind the same child backend contract.

Railway Backend Integration

The orchestration core should only know about the transport-neutral child backend contract. Railway-specific concerns live behind a provider lifecycle interface:

interface RailwayWorkerLifecycle {
  spawn(...)
  checkHealth(...)
  teardown(...)
}

The railway-worker child backend plugs that lifecycle into the same root/child lineage, context packet, normalized event, and rollup surfaces already used by local-opencode.

Railway API Integration

Service Creation

// Railway GraphQL API
// POST https://backboard.railway.com/graphql/v2
mutation {
  serviceCreate(
    input: {
      projectId: $projectId
      name: "specialist-{archetype}-{uuid}"
      source: { image: "ghcr.io/drewbietron/randal:latest" }
    }
  ) { id }
}

The gateway needs a RAILWAY_API_TOKEN with project-level permissions and the RAILWAY_PROJECT_ID for the target project.

Environment Configuration

Each spawned agent gets environment variables from its archetype template plus inherited parent config:

Variable	Source	Purpose
`AGENT_NAME`	Archetype	Unique agent identity
`AGENT_ROLE`	Archetype	Role for mesh routing (e.g., `product-engineering`)
`AGENT_EXPERTISE`	Archetype	Comma-separated expertise tags
`RANDAL_SKIP_MEILISEARCH`	`false`	Uses shared Meilisearch instance
`MEILISEARCH_URL`	Parent	Shared Meilisearch endpoint
`MEILISEARCH_MASTER_KEY`	Parent	Shared Meilisearch auth
`OPENROUTER_API_KEY`	Parent	LLM API access
`RANDAL_SPAWNED_BY`	Parent name	Tracks lineage for teardown
`RANDAL_POSSE_NAME`	Parent	Joins the same posse
`RANDAL_AUTO_TEARDOWN`	`true`	Enables idle self-teardown

Health Polling

After spawning a service, poll GET /health every 5 seconds for up to 3 minutes:

async function waitForHealth(endpoint: string, timeout = 180_000): Promise<boolean> {
  const start = Date.now();
  while (Date.now() - start < timeout) {
    try {
      const res = await fetch(`${endpoint}/health`);
      if (res.ok) return true;
    } catch {
      // Not ready yet
    }
    await sleep(5_000);
  }
  return false; // Spawn timeout
}

Railway typically takes 30–90 seconds for first deploy of a cached Docker image.

Domain Assignment

Railway auto-assigns a *.up.railway.app domain to each service. The spawned agent registers itself in Meilisearch on startup (existing behavior in packages/gateway/src/gateway.ts), so the front-door discovers it via the existing mesh registry — no manual domain configuration needed.

Service Template Registry

Agent archetypes are stored in a Meilisearch index agent-archetypes-{posse}:

{
  "id": "product-engineering",
  "name": "Product Engineering Specialist",
  "role": "product-engineering",
  "expertise": ["react", "typescript", "nodejs", "nextjs", "tailwind", "prisma"],
  "specialization": "Full-stack web development with modern frameworks",
  "dockerImage": "ghcr.io/drewbietron/randal:latest",
  "resources": {
    "memory": "2Gi",
    "cpu": "1"
  },
  "warmupTime": "60s",
  "model": "anthropic/claude-sonnet-4",
  "maxConcurrentJobs": 2
}

The routeTask() function would query this index when no existing agent matches, selecting the archetype with the highest expertise overlap for the task's domain.

Archetype Examples

Archetype	Role	Key Expertise	Typical Use
`product-engineering`	Full-stack dev	React, TypeScript, Node.js	Feature implementation
`security-compliance`	Security review	OWASP, auth, encryption	Security audits
`data-engineering`	Data pipelines	SQL, ETL, analytics	Data tasks
`infra-devops`	Infrastructure	Docker, Railway, CI/CD	Deploy and infra
`research-analyst`	Research	Web search, synthesis	Deep research tasks

Idle Teardown

Spawned agents should not run indefinitely. The teardown lifecycle:

Monitor: If an agent has 0 active jobs for N minutes (default: 15), mark as idle-candidate
Grace period: Send a "preparing to teardown" notification to the parent gateway. Wait 2 minutes for new work to arrive
Teardown sequence:
- Deregister from mesh (remove from Meilisearch posse-registry-{posse} index)
- Delete the Railway service via GraphQL API
- Remove any Meilisearch records (memory, chat history) scoped to the agent
Never tear down the front-door agent — check RANDAL_SPAWNED_BY existence

Teardown can be initiated by:

The spawned agent itself (self-monitoring idle time via RANDAL_AUTO_TEARDOWN)
The parent gateway (centralized idle monitoring)
Budget controls (monthly limit reached)

Cost Controls

interface SpawnCostConfig {
  /** Max spawned agents running simultaneously (default: 3) */
  maxConcurrentSpawns: number;

  /** Railway cost cap per hour in USD (default: 5) */
  maxHourlyBudget: number;

  /** Hard monthly spending limit in USD */
  maxMonthlyBudget: number;

  /** Always prefer existing idle agents over spawning new ones (default: true) */
  preferIdle: boolean;

  /** Minimum seconds between spawns of the same archetype (default: 300) */
  cooldownSeconds: number;

  /** Minutes before idle teardown (default: 15) */
  idleTimeoutMinutes: number;
}

Cost tracking:

Railway provides usage metrics via their API
The gateway tracks spawn/teardown events with timestamps
Estimated cost = (agent runtime hours) × (Railway per-hour rate for resource tier)
When maxMonthlyBudget is 80% consumed, switch to "prefer idle only" mode
When 100% consumed, reject all spawn requests and route to existing agents or local

Failure Modes

Failure	Detection	Recovery
Spawn timeout (Railway API slow)	Health poll exceeds 3 min	Delete the service, fall back to local execution
Health check failure (agent won't start)	Health returns non-200 after startup	Delete the service, log error with Railway logs, fall back to local
Railway API rate limit	429 response	Queue spawn requests, retry with exponential backoff (max 60s)
Budget exceeded	Cost tracking exceeds limits	Reject spawn, route to existing agents or execute locally
Orphaned agents (gateway crashes before teardown)	Heartbeat monitor detects stale agents (no heartbeat for 10 min)	Auto-teardown via a background sweep job
Recursive spawning	Agent tries to spawn another agent	Guard: spawned agents (`RANDAL_SPAWNED_BY` set) cannot call spawn API

Orphan Detection

A background cron job (every 5 minutes) on the front-door gateway:

Query Meilisearch for all agents with RANDAL_SPAWNED_BY set
Check each agent's last heartbeat timestamp
If last heartbeat > 10 minutes ago and agent is not in the active trackers map:
- Attempt health check — if responsive, just update records
- If unresponsive, initiate teardown sequence

Migration Path

Phase	Description	Status
Today	Static posse with pre-deployed agents	Current
Phase 1–3	Unified invocation layer with delegation tracking	Implemented (`feat/unified-posse-invocation`)
Phase 4a	Add `SpawnDecision` to `routeTask()`, Railway API client, basic spawn/teardown	Future
Phase 4b	Archetype registry, cost controls, idle monitoring	Future
Phase 4c	Self-optimizing: track archetype usage frequency, pre-warm popular ones	Future

Phase 4a Details

Minimal additions to the existing codebase:

New file: packages/mesh/src/spawner.ts — Railway API client, spawn/teardown/health
Modify: packages/mesh/src/router.ts — routeTask() returns RoutingDecision | SpawnDecision | null
Modify: packages/gateway/src/channels/http.ts — POST /posse/job handles SpawnDecision (spawn → wait → delegate)
New file: packages/gateway/src/archetype-registry.ts — Meilisearch-backed archetype CRUD
Modify: packages/gateway/src/gateway.ts — orphan detection cron, cost tracking

Phase 4b Details

Meilisearch index for archetypes with CRUD endpoints
Cost dashboard endpoint (GET /posse/costs)
Idle monitoring loop integrated into gateway heartbeat
Config file support for spawn cost limits

Phase 4c Details

Track which archetypes are spawned most frequently
Pre-warm popular archetypes during business hours
Auto-suggest new archetypes based on task routing misses
A/B test archetype configurations for cost vs. performance

Security Considerations

API key inheritance: Spawned agents inherit API keys from the parent's Railway project environment variables. They share the same LLM and Meilisearch access.
Principle of least privilege: Consider per-agent scoped tokens for Meilisearch (read-only for memory search, write for own agent records only). Railway service tokens can be scoped to the spawned service.
Recursion guard: Spawned agents (those with RANDAL_SPAWNED_BY set) must not be able to spawn additional agents. The spawn API should check this and reject.
Network isolation: All agents in the same Railway project share a private network. External access is via Railway's public domain only.
Secret rotation: When parent rotates API keys, spawned agents need restart or re-deploy. Consider a shared secret store (Railway shared variables) rather than per-service env vars.
Audit trail: All spawn/teardown events should be logged to Meilisearch with timestamp, archetype, cost, duration, and initiating job ID.

Dynamic agent spawning