docs/mesh-guide.md

Mesh guide

Multi-agent coordination and shared memory patterns.

Multi-Instance Mesh Guide

The @randal/mesh package lets multiple Randal instances discover each other, share work, and route jobs to the best-suited instance. This turns a collection of single agents into a coordinated mesh.


Concept overview

A mesh is a peer-to-peer network of Randal instances. Each instance:

  • Registers itself with the mesh on startup.
  • Advertises its role, expertise profile, current load, and health.
  • Accepts delegated jobs from other instances.
  • Routes incoming jobs to the best peer when a better match exists.

There is no central controller. Every instance exchanges lightweight heartbeats over HTTP, while Postgres stores the durable registry, heartbeat timestamps, routing inputs, reliability annotations, delegation audit, and stale-detection state.

┌──────────────┐     heartbeat     ┌──────────────┐
│   Instance   │ ◄──────────────► │   Instance   │
│  platform-   │                   │  product-    │
│  infra       │                   │  engineering │
└──────┬───────┘                   └──────┬───────┘
       │          heartbeat               │
       └──────────────┬───────────────────┘
                      ▼
               ┌──────────────┐
               │   Instance   │
               │  security-   │
               │  compliance  │
               └──────────────┘

Instance registration and discovery

When mesh.enabled is true, the instance:

  1. Reads mesh.endpoint to determine its own reachable URL.
  2. Contacts known peers (listed in .env or discovered via DNS/mDNS).
  3. Exchanges a registration payload containing:
    • Instance name
    • Role and expertise profile
    • Gateway endpoint
    • Current load (active jobs / capacity)
    • Model availability

Bootstrap methods

MethodConfigDescription
Static peersMESH_PEERS=url1,url2 env varComma-separated list of peer gateway URLs
DNS SRVMESH_DNS_SRV=_randal._tcp.localDNS service discovery
mDNSAutomatic on local networksZero-config LAN discovery

On startup, the instance sends POST /api/mesh/register to each known peer and begins periodic heartbeats.

Postgres-backed mesh state includes instance identity, posse, role, expertise, capabilities, model availability, endpoint, health, active job count, metadata, and last_heartbeat_at. HTTP delegation remains the transport for MVP peer work; Postgres records the audit trail and routing explanation, but it is not a hidden permanent queue.


Agent profiles

Each instance declares an expertise profile that the mesh uses for intelligent task routing. The profile has three tiers:

One of 10 predefined domain slugs. Used for pre-filtering candidates and analytics categorization.

Domain SlugDescriptionTypical expertise areas
product-engineeringFull-stack developmentReact, TypeScript, APIs, databases, architecture
platform-infrastructureDevOps and SREDocker, Kubernetes, CI/CD, Terraform, observability
security-complianceApplication and infra securityAppSec, OWASP, GDPR, SOC2, penetration testing
data-intelligenceData engineering and analyticsETL, ML, BigQuery, Spark, dashboards, BI
design-experienceUX/UI and accessibilityFigma, design systems, a11y, i18n, prototyping
content-communicationsTechnical writing and commsDocs, blog, release notes, marketing copy
revenue-growthSales and business developmentGTM, partnerships, pricing, conversion funnels
customer-operationsSupport and successZendesk, onboarding, SLAs, churn, NPS
strategy-financeProduct management and financeRoadmaps, OKRs, budgets, sprint planning
legal-governanceLegal and policyContracts, NDAs, IP, licensing, regulatory
mesh:
  role: product-engineering

A natural language description of the agent's detailed skills. This text is embedded (vectorized) at startup and used for semantic matching at routing time.

Three formats are supported:

Inline string:

mesh:
  expertise: >
    Expert in React, TypeScript, and frontend architecture.
    Deep knowledge of Next.js SSR, design systems, and
    responsive UI patterns.

File reference:

mesh:
  expertise:
    file: ./profiles/frontend-eng.md

Combined (file + additional context):

mesh:
  expertise:
    file: ./profiles/frontend-eng.md
    additional: "Also experienced with the internal billing system and Stripe integration"

The file format follows the same pattern as identity.knowledge — point to a markdown file containing a detailed expertise description. At boot, the file is read, concatenated with any additional text, and the full text is embedded for semantic matching.


Routing algorithm

When a job arrives, the mesh router scores every available instance and picks the best one. The score is a weighted sum of four factors:

score = w_e × expertise_match
      + w_r × reliability_score
      + w_l × (1 - load_ratio)
      + w_m × model_match

Weights

FactorKeyDefaultDescription
Expertise matchexpertise0.4Semantic similarity between task and agent expertise profile (2-tier fallback)
Reliability scorereliability0.3Historical success rate for this domain (from @randal/analytics)
Load availabilityload0.2Inverse of current load ratio (0 = fully loaded, 1 = idle)
Model matchmodelMatch0.11.0 if the instance has access to the requested model

Reliability annotations and delegation outcomes are stored in Postgres and scoped by tenant/project. Routing explanations should reference the selected inputs without storing raw credentials, bearer tokens, signed artifact URLs, or full private transcripts.

Configure weights in your config:

mesh:
  routingWeights:
    expertise: 0.4
    reliability: 0.3
    load: 0.2
    modelMatch: 0.1

Postgres Operations

Use shared Postgres for a posse unless you intentionally isolate agents by database. Each agent still has its own identity/scope; shared memory, graph context, mesh registry, job timelines, reliability annotations, and delegation audit remain tenant/project-scoped rows.

Operational checklist:

  1. Run randal db migrate before starting the posse.
  2. Set DATABASE_URL or RANDAL_DATABASE_URL for every service.
  3. Set RANDAL_REQUIRE_MEMORY=true for hosted services so /health fails when Postgres is unavailable.
  4. Verify randal db status --write-check, /health, randal mesh status, and one delegated job.
  5. Treat stale peers as unavailable when heartbeat age exceeds the configured tolerance.
  6. Use randal db dump for whole-posse backups and randal db scoped-copy for one org/project move.

Future DB-backed queues must preserve idempotency keys, redaction, retention, tenant boundaries, routing explanation IDs, and HTTP delegation compatibility.

2-tier expertise scoring

The expertise match factor uses a cascading fallback strategy:

  1. Semantic (Tier 1): If both the task prompt and the agent's expertise profile have been embedded (requires OPENROUTER_API_KEY), the router computes cosine similarity between the two vectors. This is the most accurate tier — it understands that "fix the login flow" matches an agent with "authentication and session management" expertise, even though the words differ.

  2. Role match (Tier 2): If embeddings are unavailable, the router performs an exact match on mesh.role against the auto-detected task domain. Score: 1.0 for exact match, 0.2 for no match.

Routing decision flow

  1. Auto-detect domain: Classify the task's domain from keywords using the 10-domain taxonomy — or accept an explicit domain hint from the caller.
  2. Embed the task: If the embedding service is available, vectorize the task description (single API call, <500ms).
  3. Pre-filter candidates: If enough peers exist (>2), narrow to those whose role matches the detected domain. If no role matches, keep all candidates.
  4. Score all candidates: Compute the weighted sum for each remaining peer (including self).
  5. Route: If the top-scoring peer is self, execute locally. If remote, delegate via POST /api/mesh/delegate and stream results back.

Health monitoring

Each instance sends heartbeats to all known peers at a configurable interval (default: 30 seconds). A heartbeat contains:

{
  "name": "eng-agent",
  "role": "product-engineering",
  "expertise": "React, TypeScript, frontend architecture...",
  "endpoint": "http://eng-agent:7600",
  "load": 0.35,
  "activeJobs": 2,
  "uptime": 86400,
  "version": "0.1"
}

An instance is marked unhealthy if it misses 3 consecutive heartbeats (~90 seconds by default). Unhealthy instances receive a routing score of 0 and are skipped during delegation.

When an unhealthy instance resumes heartbeats, it is automatically re-admitted to the mesh.


Cross-instance job delegation

Delegation follows these rules:

  1. Depth limit: Delegated jobs carry a depth counter. An instance will not re-delegate a job that has already been delegated runner.maxDelegationDepth times (default: 2).
  2. Sticky sessions: Once a job is delegated to a peer, follow-up messages in the same conversation are routed to the same peer unless it becomes unhealthy.
  3. Fallback: If the chosen peer rejects or times out, the originating instance falls back to local execution.
  4. Streaming: Delegated jobs stream events back via SSE so the end user sees real-time progress.

API endpoints

EndpointMethodDescription
/api/mesh/registerPOSTRegister this instance with a peer
/api/mesh/heartbeatPOSTSend health heartbeat
/api/mesh/delegatePOSTDelegate a job to this instance
/api/mesh/statusGETReturn mesh topology and health

CLI commands

randal mesh status

Display the current mesh topology:

$ randal mesh status

Mesh Status
──────────────────────────────────────────────────────────────────────────────────
Instance        Role                     Expertise                   Load   Health
──────────────────────────────────────────────────────────────────────────────────
local (self)    platform-infrastructure   K8s, Terraform, CI/CD...   0.15   healthy
eng-agent       product-engineering       React, TypeScript, APIs..  0.42   healthy
sec-agent       security-compliance       AppSec, OWASP, audits...   0.00   healthy
docs-agent      content-communications    Tech writing, guides...    0.78   degraded
──────────────────────────────────────────────────────────────────────────────────
Total instances: 4 │ Healthy: 3 │ Unhealthy: 1

randal mesh route

Preview which instance would handle a given prompt:

$ randal mesh route "Fix the Docker build"

Routing Analysis
───────────────────────────────────────────────────────────
Domain detected: platform-infrastructure

Instance        Expert  Rel    Load   Model  Score
───────────────────────────────────────────────────────────
local (self)    0.920  0.270  0.170  0.100  0.94
eng-agent       0.310  0.210  0.200  0.100  0.51
sec-agent       0.050  0.150  0.120  0.100  0.37
───────────────────────────────────────────────────────────
→ Routing to: local (self)

Configuration examples

Minimal mesh instance

No profile fields required — the instance participates in the mesh but receives a neutral expertise score (0.5) during routing.

name: worker-1
runner:
  workdir: ./workspace

mesh:
  enabled: true
  endpoint: http://localhost:7600

Infrastructure agent with expertise profile

name: infra-agent
runner:
  workdir: ./workspace
  defaultModel: anthropic/claude-sonnet-4

mesh:
  enabled: true
  role: platform-infrastructure
  expertise: >
    Kubernetes cluster management, Terraform IaC, GitHub Actions CI/CD,
    Docker containerization, Prometheus/Grafana observability stack,
    AWS EKS and GCP GKE administration.
  endpoint: http://infra-agent:7600
  routingWeights:
    expertise: 0.5
    reliability: 0.25
    load: 0.15
    modelMatch: 0.1

File-based expertise profile

name: frontend-agent
runner:
  workdir: ./workspace

mesh:
  enabled: true
  role: product-engineering
  expertise:
    file: ./profiles/frontend-eng.md
    additional: "Also experienced with the internal billing system"
  endpoint: http://frontend-agent:7600

Three-node mesh (docker-compose)

Each configs/*.yaml file should have mesh.role and mesh.expertise set for optimal routing. See the examples above for the config format.

# docker-compose.yml
services:
  frontend-agent:
    image: ghcr.io/hassion-studio/randal:latest
    environment:
      MESH_PEERS: http://backend-agent:7600,http://infra-agent:7600
    volumes:
      - ./configs/frontend.yaml:/app/randal.config.yaml

  backend-agent:
    image: ghcr.io/hassion-studio/randal:latest
    environment:
      MESH_PEERS: http://frontend-agent:7600,http://infra-agent:7600
    volumes:
      - ./configs/backend.yaml:/app/randal.config.yaml

  infra-agent:
    image: ghcr.io/hassion-studio/randal:latest
    environment:
      MESH_PEERS: http://frontend-agent:7600,http://backend-agent:7600
    volumes:
      - ./configs/infra.yaml:/app/randal.config.yaml

Tips

  • Start with 2 instances and add more as your workload grows.
  • Use randal mesh route to verify routing before deploying.
  • Write detailed expertise descriptions — the more specific, the better the semantic routing. Include technologies, frameworks, and domain knowledge.
  • Use randal mesh route 'your task' to preview how the expertise matcher scores your peers.
  • Combine with @randal/analytics for reliability-informed routing.
  • Monitor the /api/mesh/status endpoint from your infrastructure tooling.
  • Set MESH_PEERS via environment variables so the same config image works across environments.