docs/voice-video-guide.md

Voice beta

Voice and video setup notes for beta deployments.

Voice & Video Guide

Randal's voice support is optional. A normal text-only setup does not need any of this. In this branch, the primary first working path is PSTN voice with Twilio. Browser/admin voice is supported, but it is secondary.

When enabled, the @randal/voice package wires LiveKit rooms, Twilio SIP trunks, and STT/TTS providers into the runner loop so Randal can listen, think, and speak in real time.


Minimum Viable PSTN Voice On Railway

Use this checklist if your goal is: incoming or outgoing phone calls reach Randal, Randal answers through Twilio, and the call runs through LiveKit + Deepgram + ElevenLabs.

  1. Create or choose a Railway service using randal.config.railway.yaml.
  2. Set the required GitHub Actions repository secrets so the Railway deploy workflow can upsert them into Railway.
  3. Configure Twilio in a dedicated subaccount.
  4. Point RANDAL_VOICE_PUBLIC_URL at the public HTTPS/WSS host that Twilio can reach for Randal's /voice/* routes.
  5. Deploy.
  6. Test inbound or outbound PSTN calling.

If you are only testing browser/admin voice, skip to Browser-only testing (secondary path) below.


What To Add Where

1. GitHub Actions repository secrets

If you use .github/workflows/railway-deploy.yml, these are the values that get copied into Railway. Your local .env does not get copied into Railway by that workflow.

Required for PSTN voice on Railway:

  • RAILWAY_TOKEN
  • RAILWAY_WORKSPACE_ID
  • One provider secret: OPENROUTER_API_KEY or ANTHROPIC_API_KEY or OPENAI_API_KEY
  • MEILI_MASTER_KEY
  • RANDAL_API_TOKEN
  • RANDAL_VOICE_PUBLIC_URL
  • LIVEKIT_URL
  • LIVEKIT_API_KEY
  • LIVEKIT_API_SECRET
  • DEEPGRAM_API_KEY
  • ELEVENLABS_API_KEY
  • ELEVENLABS_VOICE_ID (recommended)
  • TWILIO_ACCOUNT_SID
  • TWILIO_AUTH_TOKEN
  • TWILIO_PHONE_NUMBER

Optional depending on your deployment:

  • GH_TOKEN
  • TAVILY_API_KEY
  • DISCORD_BOT_TOKEN
  • FAL_KEY

2. Railway config in repo

Keep the checked-in randal.config.railway.yaml as the service config. It already includes:

  • gateway.channels: [http, voice, discord]
  • the voice: block for LiveKit, Twilio, Deepgram, and ElevenLabs
  • credential allowlists/inheritance for the voice env vars

3. Local .env

Use local .env only for local testing. It is not the source of truth for the Railway deploy workflow.

4. Twilio account setup

Use a dedicated Twilio subaccount for this integration so voice numbers, billing, and webhook changes stay isolated from anything else in your main Twilio account.

The current code uses Twilio account credentials directly:

  • TWILIO_ACCOUNT_SID
  • TWILIO_AUTH_TOKEN
  • TWILIO_PHONE_NUMBER

It does not currently use Twilio API keys for the PSTN runtime path.


Before you start

Choose the parts you actually need:

Use caseRequired services/accounts
Browser voice in the dashboard or your own UILiveKit + one STT provider + one TTS provider
Outbound/inbound phone callsLiveKit + one STT provider + one TTS provider + Twilio
Video meeting participationSame as voice, plus the meeting platform's SIP/dial-in support

Required accounts and services for the common PSTN path in this repo:

  1. LiveKit Cloud account or your own LiveKit server
  2. Deepgram account for STT
  3. ElevenLabs account for TTS
  4. Twilio account only if you want PSTN phone calls
  5. A public HTTPS/WSS URL that reaches the Randal gateway when voice traffic comes from outside your machine

Required environment variables:

LIVEKIT_URL=wss://your-project.livekit.cloud
LIVEKIT_API_KEY=APIxxxxxxxx
LIVEKIT_API_SECRET=xxxxxxxxxxxxxxxxxxxxxxxx
DEEPGRAM_API_KEY=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
ELEVENLABS_API_KEY=sk_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
ELEVENLABS_VOICE_ID=pNInz6obpgDQGcFmaJgB   # optional, falls back to a default voice
RANDAL_VOICE_PUBLIC_URL=https://voice.example.com
TWILIO_ACCOUNT_SID=ACxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx   # phone calls only
TWILIO_AUTH_TOKEN=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx      # phone calls only
TWILIO_PHONE_NUMBER=+15551234567                        # phone calls only

RANDAL_VOICE_PUBLIC_URL must be the public base URL for the gateway voice routes. It is not the LiveKit URL. Twilio and remote browsers use this URL to reach Randal's own /voice/... endpoints.

For PSTN on Railway, RANDAL_VOICE_PUBLIC_URL should usually be one of:

  • the public Railway service domain, if Twilio can reach it directly and WebSocket traffic behaves correctly
  • a public reverse proxy or edge host that forwards to the Railway gateway

Architecture overview

Caller ──► Twilio SIP ──► LiveKit Room ──► Randal Voice Engine
                                               │
                                    ┌──────────┼──────────┐
                                    ▼          ▼          ▼
                                   STT      Runner      TTS
                                (Deepgram)  (Ralph)  (ElevenLabs)
  1. Audio arrives via a LiveKit room (browser widget, SIP, or direct).
  2. The voice engine streams audio chunks to the STT provider.
  3. Transcribed text is fed into the runner as a normal message.
  4. The runner's response text is sent to the TTS provider.
  5. Synthesised audio is published back into the LiveKit room.

What runs where:

  • randal serve runs the Randal gateway and the voice HTTP/WebSocket routes.
  • docker-compose.voice.yml starts local media infrastructure only: Redis, LiveKit server, and the LiveKit SIP bridge.
  • Twilio talks to the public gateway voice routes, not directly to your local randal serve process unless you expose it with a tunnel or reverse proxy.

LiveKit setup

  1. Create an account at livekit.io.
  2. Copy the WebSocket URL, API Key, and API Secret from the project dashboard.
  3. Add them to your .env:
LIVEKIT_URL=wss://your-project.livekit.cloud
LIVEKIT_API_KEY=APIxxxxxxxx
LIVEKIT_API_SECRET=xxxxxxxxxxxxxxxxxxxxxxxx

Self-hosted

Run LiveKit on your own infrastructure with Docker:

docker run --rm -p 7880:7880 -p 7881:7881 -p 7882:7882/udp \
  livekit/livekit-server --dev

For production, see the LiveKit deployment docs. The default dev server uses APIxxxxxxxx / xxxxxxxxxxxxxxxxxxxxxxxx as key/secret.

For the full phone/media development stack in this repo, run:

docker compose -f docker-compose.voice.yml up -d

This starts:

  • Redis
  • LiveKit server
  • LiveKit SIP bridge

Use docker/voice/livekit.yaml and docker/voice/sip.yaml as the local reference configs. This compose file does not start the Randal gateway; run randal serve separately.

PSTN/Twilio Testing (Primary Path)

Twilio account guidance

Recommended setup:

  1. Create a dedicated Twilio subaccount for Randal voice.
  2. Buy or move one phone number into that subaccount.
  3. Use the subaccount's TWILIO_ACCOUNT_SID and TWILIO_AUTH_TOKEN.
  4. Set TWILIO_PHONE_NUMBER to the E.164 number you want Randal to use.

This repo currently expects Twilio account credentials, not Twilio API keys.

Twilio setup checklist

  1. Buy a phone number in the Twilio console.
  2. Configure the number or call flow so Twilio reaches your public Randal voice routes.
  3. Set:
TWILIO_ACCOUNT_SID=ACxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
TWILIO_AUTH_TOKEN=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
TWILIO_PHONE_NUMBER=+15551234567
  1. Set RANDAL_VOICE_PUBLIC_URL to the public HTTPS/WSS host that Twilio can reach.
  2. Make sure these routes are publicly reachable:
  • POST /voice/twiml/inbound
  • POST /voice/twiml/outbound/:sessionId
  • POST /voice/twilio/status/:sessionId
  • POST /voice/twilio/stream-status/:sessionId
  • GET /voice/media-stream/:sessionId
  1. Deploy and place a test call.

Current implemented route posture:

  • Public voice ingress: the Twilio routes above, protected by Twilio signature validation
  • Protected admin/browser routes: POST /api/voice/token, GET /voice/status, and the rest of the authenticated HTTP surface

Railway Deployment For PSTN Voice

For Railway, the easiest operator model is:

  1. Keep randal.config.railway.yaml in the repo.
  2. Set the GitHub Actions secrets listed above.
  3. Let .github/workflows/railway-deploy.yml upsert those secrets into Railway.
  4. Treat local .env as local-only.

Minimum Railway PSTN checklist:

  • randal.config.railway.yaml includes - type: voice and the voice: block
  • GitHub repository secrets include all LIVEKIT_*, DEEPGRAM_API_KEY, ELEVENLABS_*, TWILIO_*, and RANDAL_VOICE_PUBLIC_URL
  • RANDAL_VOICE_PUBLIC_URL is public and Twilio-reachable
  • Twilio uses the same public base URL for Randal's /voice/* routes
  • Railway hosts the gateway/runner; LiveKit and Twilio stay external services

Why The PSTN Stack Currently Uses Four Services

For the current PSTN path, each service has a separate job:

  • Twilio: phone numbers, PSTN ingress/egress, webhook delivery, and media stream handoff
  • LiveKit: real-time room/media coordination and session plumbing
  • Deepgram: speech-to-text for live caller audio
  • ElevenLabs: text-to-speech for spoken responses back into the call

That means this build keeps the current multi-provider runtime on purpose. A future provider-consolidation pass may simplify the stack, but that is a later cleanup project, not part of this integration build.


Scaling Notes For ~10 And ~100 Concurrent Calls

What the operator needs to know:

  • 10 concurrent calls is usually a configuration and quota check.
  • 100 concurrent calls is a systems-capacity exercise across Twilio, LiveKit, Deepgram, ElevenLabs, and Randal itself.

At around 10 concurrent calls, check:

  • Twilio account/subaccount limits, call routing, and webhook reliability
  • LiveKit room/media capacity for the expected codec and region
  • Deepgram concurrent streaming limits
  • ElevenLabs throughput and latency under overlapping TTS requests
  • Railway instance CPU/memory headroom for the gateway process

At around 100 concurrent calls, assume you will need to actively manage:

  • Twilio concurrency, phone number throughput, and status/webhook burst handling
  • LiveKit scaling, region placement, and media-node sizing
  • Deepgram stream concurrency and backpressure behavior
  • ElevenLabs synthesis throughput and the latency impact on turn-taking
  • Randal application scaling: more CPU, more memory, and likely multiple app instances for webhook/WebSocket load

Practical guidance:

  • Load-test with the same Twilio subaccount, LiveKit project, Deepgram account, and ElevenLabs plan you expect to use in production.
  • Watch end-to-end latency, not just gateway CPU.
  • Treat RANDAL_VOICE_PUBLIC_URL and Twilio webhook delivery as production dependencies, not just config values.
  • Expect 100 concurrent calls to require vendor-quota reviews and staged rollout, not just a bigger Railway instance.

Local development flow

For a beginner-friendly local setup, do the steps in this order:

  1. Copy .env.example to .env and fill in the voice env vars you need.
  2. Start the media side:
docker compose -f docker-compose.voice.yml up -d
  1. Start the gateway separately:
randal serve
  1. Enable the voice channel and voice.enabled: true in your config.
  2. For browser voice on your own machine, you can usually test with local LiveKit plus the local dashboard.
  3. For Twilio webhooks or any remote client, expose the gateway with a public HTTPS tunnel and set RANDAL_VOICE_PUBLIC_URL to that public URL.

Example tunnel flow:

# Example with ngrok
ngrok http 7600

# Then set
RANDAL_VOICE_PUBLIC_URL=https://<your-ngrok-subdomain>.ngrok.app

STT provider setup

Deepgram (default)

  1. Sign up at deepgram.com and create an API key.
  2. Add to .env:
DEEPGRAM_API_KEY=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
  1. Config:
voice:
  stt:
    provider: deepgram
    apiKey: ${DEEPGRAM_API_KEY}
    model: nova-2          # optional, defaults to provider's latest

OpenAI Whisper

voice:
  stt:
    provider: whisper
    apiKey: ${OPENAI_API_KEY}
    model: whisper-1

AssemblyAI

voice:
  stt:
    provider: assemblyai
    apiKey: ${ASSEMBLYAI_API_KEY}

TTS provider setup

ElevenLabs (default)

  1. Get an API key from elevenlabs.io.
  2. Choose a voice ID from the voice library.
voice:
  tts:
    provider: elevenlabs
    apiKey: ${ELEVENLABS_API_KEY}
    voice: pNInz6obpgDQGcFmaJgB    # "Adam" — or any voice ID

OpenAI TTS

voice:
  tts:
    provider: openai
    apiKey: ${OPENAI_API_KEY}
    voice: alloy

Cartesia

voice:
  tts:
    provider: cartesia
    apiKey: ${CARTESIA_API_KEY}
    voice: sonic-english

Edge TTS (free, no API key)

voice:
  tts:
    provider: edge
    voice: en-US-GuyNeural

Browser-only testing (secondary path)

Randal ships a lightweight voice widget that connects to a LiveKit room from the browser. This is useful for admin testing, but it is not the primary first deployment path in this branch.

To enable it:

  1. Make sure the voice channel is in your gateway config:
gateway:
  channels:
    - type: voice
  1. Make sure the voice block is enabled and has working LiveKit/STT/TTS credentials.
  2. Start randal serve.
  3. The dashboard (served by @randal/dashboard) automatically renders a microphone button when voice is enabled.
  4. Clicking the button requests a LiveKit participant token from the gateway, joins the room, and streams audio.

For custom UIs, use the LiveKit JavaScript SDK and request a token from POST /api/voice/token.

Browser voice uses the same authenticated HTTP admin surface as the rest of the gateway. Anonymous browser clients do not get an implicit admin voice session, and if HTTP auth is not configured the protected browser voice routes fail closed.

Browser-only testing does not require Twilio. PSTN testing does.


Video call participation

Randal can join Zoom, Google Meet, and Microsoft Teams meetings via SIP or RTMP.

How it works

  1. SIP dial-in: Many conferencing platforms expose SIP URIs for meetings. Randal uses the LiveKit SIP bridge to dial into the meeting as a participant.
  2. Video processing: When video.enabled is true, Randal periodically captures frames from the video track and sends them to a vision model for scene understanding.

Configuration

voice:
  video:
    enabled: true
    visionModel: gpt-4o          # model for frame analysis
    publishScreen: false         # share Randal's screen into the call
    recordSessions: true         # save recordings locally
    recordPath: ./recordings

Meeting-specific notes

PlatformMethodNotes
ZoomSIP URIRequires Zoom SIP connector add-on
Google MeetSIP dial-inAvailable on Google Workspace Business+
Microsoft TeamsSIP via Direct RoutingRequires Teams Phone System license

Outbound calling

Randal can place outbound phone calls via Twilio:

randal call +15559876543 --prompt "Check in with the client about delivery"

Or programmatically through the gateway API:

curl -X POST http://localhost:7600/voice/call \
  -H "Authorization: Bearer $AUTH_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"to": "+15559876543", "reason": "Check in about delivery"}'

The call flow:

  1. Twilio places the outbound call.
  2. When answered, audio is bridged into a LiveKit room.
  3. The STT/Runner/TTS pipeline handles the conversation.

Turn detection

The voice engine detects when the caller stops speaking before generating a response. Two modes are available:

voice:
  turnDetection:
    mode: auto      # VAD-based automatic detection (default)
    # mode: manual  # wait for explicit push-to-talk signal

Full configuration example

name: voice-assistant
runner:
  workdir: ./workspace

voice:
  enabled: true

  livekit:
    url: ${LIVEKIT_URL}
    apiKey: ${LIVEKIT_API_KEY}
    apiSecret: ${LIVEKIT_API_SECRET}

  twilio:
    accountSid: ${TWILIO_ACCOUNT_SID}
    authToken: ${TWILIO_AUTH_TOKEN}
    phoneNumber: ${TWILIO_PHONE_NUMBER}

  stt:
    provider: deepgram
    apiKey: ${DEEPGRAM_API_KEY}
    model: nova-2

  tts:
    provider: elevenlabs
    apiKey: ${ELEVENLABS_API_KEY}
    voice: pNInz6obpgDQGcFmaJgB

  turnDetection:
    mode: auto

  video:
    enabled: false

gateway:
  channels:
    - type: voice
      access:
        trustedCallers:
          - ${ADMIN_CALLER_E164}
        unknownInbound: external
        defaultExternalGrants: [memory]
    - type: http
      port: 7600
      auth: ${API_TOKEN}

If you do not want voice, remove - type: voice and the entire voice: block. The rest of Randal works normally without any voice-specific credentials.

For higher-assurance deployments, keep browser/admin voice on the authenticated gateway surface and treat PSTN/Twilio routes as the only intentionally public voice ingress.


Troubleshooting

SymptomLikely causeFix
No audio in roomLiveKit URL wrong or unreachableVerify LIVEKIT_URL and network access
STT returns emptyAPI key invalid or rate-limitedCheck provider dashboard for errors
High latencySTT + TTS round-trip too slowTry deepgram STT + edge TTS for lowest latency
Outbound call failsTwilio credentials or phone number misconfiguredVerify in Twilio console
Video frames not processedvideo.enabled not set to trueAdd video.enabled: true to config