Every voice-enabled agent needs three components. The interesting part is how you compose them.
flowchart LR
MIC["Microphone /\naudio input"] --> STT["Speech-to-Text\n(STT / ASR)"]
STT -->|text| AGENT["Agent\n(LLM + tools)"]
AGENT -->|text| TTS["Text-to-Speech\n(TTS)"]
TTS --> SPK["Speaker /\naudio output"]
The agent itself is text-in, text-out. Voice is a transport layer. This is why the OpenAI-compatible endpoint approach works: the agent doesn't need to know it's doing voice.
Turning audio into text. The quality here determines whether the agent understands you.
| Model | Params | WER (LibriSpeech clean) | Languages | Notes |
|---|---|---|---|---|
| Whisper (OpenAI) | 39M–1.5B | ~2.7% (large-v3) | 99 | The default. Runs on CPU. base model is fast but noisy on accents. large-v3-turbo is the sweet spot. 10M+ downloads/month. |
| faster-whisper | same | same | 99 | CTranslate2 backend. 4x faster than OpenAI Whisper, lower memory. Drop-in replacement. |
| WhisperKit | same | same | 99 | CoreML-optimized for Apple Silicon. 10M downloads. Native Swift, runs on iPhone/Mac. |
| Parakeet (NVIDIA) | 600M | 1.69% | EN (v2), 25 EU (v3) | FastConformer-TDT. More accurate than Whisper, word timestamps, punctuation. Needs NeMo (CUDA). Free API at build.nvidia.com. |
| Nemotron-3.5 ASR (NVIDIA) | 600M | ~4% (streaming) | 36 | New (June 2026). Streaming ASR — transcribes in real-time, no silence detection needed. 17x more concurrent streams vs Parakeet RNNT 1.1B. ONNX + CoreML variants available. The real-time voice upgrade path. |
| Sherpa-ONNX | various | various | many | Lightweight ONNX runtime. Good for edge/mobile. No Python dependencies. |
| Service | Latency | Price | Notes |
|---|---|---|---|
| OpenAI Whisper API | ~1-3s | $0.006/min | Hosted Whisper. Simple, reliable. |
| Deepgram | ~300ms | $0.0043/min (Nova-3) | Streaming support. Very low latency. Popular for real-time voice agents. |
| Google Cloud STT | ~500ms | $0.006/min (Chirp 2) | 128 languages. Chirp 2 is competitive with Whisper large. |
| AssemblyAI | ~500ms | $0.0037/min (Universal-2) | Streaming, speaker diarization. Good for meetings/calls. |
| NVIDIA Riva (Parakeet) | ~200ms | Free tier | gRPC API. Best accuracy numbers. Free for development. |
SmolPaws today: Whisper base model, local, zero cost. Works but struggles with accents and short utterances. Upgrade path: faster-whisper large-v3-turbo locally, or Parakeet via NVIDIA API for best accuracy.
Turning agent text into natural-sounding speech. The quality gap between free and paid has narrowed dramatically.
| Model | Params | Quality | Speed | Notes |
|---|---|---|---|---|
| Kokoro | 82M | Excellent | Real-time on CPU | 14M downloads. Apache 2.0. Best quality-to-size ratio. Multiple voices. ONNX version available. |
| Higgs Audio v3 (Boson) | 4B | Excellent | Sub-second TTFA | New (June 2026). 102 languages, 21 emotions, singing/whispering/shouting. Zero-shot voice cloning. Speaks, not just reads. Research/non-commercial license. |
| dots.tts (RedNote) | 2B | High | Near real-time | New (June 2026). Fully continuous pipeline — no discrete codec tokens anywhere. AR + flow-matching over 48kHz AudioVAE. Trained on 1.5M hours. Apache 2.0. |
| Qwen3-TTS | 0.6B–1.7B | High | ~real-time | Custom voice cloning. 2M downloads. Multilingual. Apache 2.0. |
| F5-TTS | 335M | High | Near real-time | Flow-matching. Zero-shot voice cloning from 10s reference. CC-BY-NC. |
| Piper | small | Good | Very fast | VITS-based. 44 languages. Runs on Raspberry Pi. Used by Home Assistant. MIT. |
| edge-tts | n/a | Good | Fast | Free, uses Microsoft Edge's TTS. No API key needed. pip install edge-tts. |
macOS say | n/a | Decent | Instant | Built-in. Zero latency, zero cost. "Evan (Enhanced)" is the best English voice. |
| Service | Quality | Price | Notes |
|---|---|---|---|
| ElevenLabs | Best | ~$0.18/1K chars | Industry leader. 5K+ voices, 32 languages. Voice cloning. Also provides the full Conversational AI platform. |
| OpenAI TTS | Very good | $15/1M chars | Simple API. 6 voices. Good enough for most use cases. |
| Google Cloud TTS | Very good | $4–$16/1M chars | Journey voices are expressive. WaveNet for quality, Standard for cost. |
| MiniMax | Good | Low | Emerging. Used by Hermes Agent as a built-in provider. |
SmolPaws today: macOS say -v "Evan (Enhanced)" for local speech + WhatsApp voice notes via the outbox. Upgrade path: Kokoro for excellent local TTS, or ElevenLabs for managed hosting with voice cloning.
For reference, Hermes ships with 11 built-in TTS providers plus support for custom command-line providers:
edge, elevenlabs, openai, minimax, xai, mistral, gemini, neutts, kittentts, piper
+ any CLI command via config (e.g. "piper -m model.onnx -f {output_path}")
STT and TTS give you voice messages. Real-time conversation adds: turn-taking, interruption handling, latency management, and continuous audio streaming. This is the hard part.
flowchart TB
subgraph A ["Architecture 1: Cascaded (STT + LLM + TTS)"]
direction LR
A1["Audio in"] --> A2["STT"]
A2 -->|text| A3["LLM / Agent"]
A3 -->|text| A4["TTS"]
A4 --> A5["Audio out"]
end
subgraph B ["Architecture 2: Native multimodal"]
direction LR
B1["Audio in"] --> B2["Multimodal LLM\n(audio-in, audio-out)"]
B2 --> B3["Audio out"]
end
| Cascaded (STT + LLM + TTS) | Native multimodal | |
|---|---|---|
| Latency | Higher (3 serial steps) | Lower (single model) |
| Quality | Each component best-in-class | Improving fast |
| Flexibility | Mix and match components | Locked to one provider |
| Agent tools | LLM is text-based, tools work normally | Tool support varies |
| Examples | ElevenLabs Agents, Hermes Discord voice | OpenAI Realtime API, Gemini Live |
For AI agents with tools, the cascaded approach currently wins: the LLM layer stays text-based, so tool calling works identically to text conversations. Native multimodal is catching up but tool integration is less mature.
This is what Hermes does for Discord: join the voice channel, capture audio at the packet level, run the full pipeline locally.
flowchart LR
VC["Discord\nVoice Channel"] -->|RTP packets| VR["VoiceReceiver\n(Opus decode,\nsilence detect)"]
VR -->|PCM| W["Whisper\nSTT"]
W -->|text| AG["Agent"]
AG -->|text| TTS["TTS\n(ElevenLabs/\nOpenAI/edge)"]
TTS -->|PCM| MX["VoiceMixer\n(ambient bed +\nspeech ducking)"]
MX -->|Opus| VC
Components and what's generic vs platform-specific:
| Component | Platform-specific? | What it does |
|---|---|---|
| Voice transport | Yes | Connect to voice channel, receive/send audio packets. Discord uses WebSocket + RTP. Other platforms have their own protocols. |
| Codec decode (Opus) | No | Standard audio codec. Same everywhere. |
| Silence detection | No | Detect end of utterance by amplitude threshold + timer (1.5s in Hermes). |
| STT (Whisper/Parakeet) | No | Transcribe audio to text. Any STT engine works. |
| Agent (LLM + tools) | No | Process text, run tools, generate response. Completely decoupled from voice. |
| TTS | No | Generate speech audio from text. Any TTS engine works. |
| Audio mixer | No | Mix ambient + speech with ducking. Pure DSP (numpy). Works on any PCM stream. |
| Codec encode (Opus) | No | Encode mixed audio back. Standard. |
Only the transport layer is platform-specific. The rest is a generic voice engine that could be extracted and reused across platforms. Hermes hasn't done this extraction yet — it's all inside the Discord adapter — but architecturally, it's separable.
This is the ElevenLabs approach: the voice platform handles STT, TTS, turn-taking, and phone integration. The agent just needs to expose an OpenAI-compatible /v1/chat/completions endpoint.
flowchart LR
PH["Phone call /\nWeb widget /\nMobile app"] <-->|audio| EL["ElevenLabs\nConversational AI\n(STT + TTS +\nturn-taking)"]
EL <-->|"POST /v1/chat/completions"| AG["Agent Server\n(OpenAI-compatible\nendpoint)"]
AG --> RT["Agent Runtime\n(tools, memory,\nskills)"]
| Platform-native | Voice-as-a-service | |
|---|---|---|
| Agent code changes | Significant (voice pipeline) | One HTTP endpoint |
| Phone calls | Need SIP/Twilio integration | Built-in (Twilio, SIP) |
| Latency control | Full (local processing) | Depends on service |
| Cost | Free (local compute) | Per-minute pricing |
| Voice quality | Your choice of TTS | ElevenLabs quality (best) |
| Platforms unlocked | One at a time | All at once (phone, web, mobile) |
Key insight: The OpenAI-compatible endpoint is the convergence point. Both Hermes (PR #1756) and OpenHands (PR #3545) are building this. Once an agent exposes /v1/chat/completions, it's voice-ready without any voice code — ElevenLabs (or any voice frontend) handles the rest.
An important distinction that affects architecture choices.
| Voice messages | Real-time voice | |
|---|---|---|
| Examples | WhatsApp voice notes, Telegram voice messages, Slack audio clips | Discord voice channels, phone calls, ElevenLabs widget |
| Latency tolerance | Seconds to minutes (async) | Sub-second (conversational) |
| Implementation | Download audio file → STT → agent → TTS → send audio file | Continuous audio stream → turn detection → STT → agent → TTS → stream back |
| Turn-taking | Explicit (user records, sends) | Automatic (silence detection, interruption handling) |
| Complexity | Low (file-based) | High (streaming, real-time processing) |
Most messaging platforms only support voice messages. Discord is the notable exception with real-time voice channels. Phone calls via ElevenLabs/Twilio are real-time. The voice-as-a-service approach handles real-time complexity for you.
| Capability | Status | How |
|---|---|---|
| Receive voice (WhatsApp) | ✅ | Baileys downloads OGG → Whisper base transcribes |
| Send voice (WhatsApp) | ✅ | macOS say → ffmpeg → OGG Opus → voice outbox → baileys ptt:true |
| Local TTS | ✅ | macOS say -v "Evan (Enhanced)" |
| Local STT | ✅ | Whisper base via ~/.smolpaws/tools/whisper-env/ |
| Real-time voice | ❌ | Not yet. Planned via OpenAI-compatible endpoint + ElevenLabs. |
| Phone calls | ❌ | Not yet. Possible once /v1/chat/completions ships (PR #3545). |
| Discord voice | ❌ | Text only. Voice channel support scoped but not started. |
base with faster-whisper large-v3-turbo (local, free) or NVIDIA Parakeet (API, free tier). For real-time streaming: Nemotron-3.5 ASR (36 languages, ONNX/CoreML available).@discordjs/voice + Opus bindings. The generic voice pipeline (~400 lines) is well-understood from the Hermes study./v1/chat/completions on the agent-server