Voice for AI Agents landscape

What can be done with voice today, how the pieces fit together, and what it takes to give an agent a voice.
June 2026.

The Voice Pipeline

Every voice-enabled agent needs three components. The interesting part is how you compose them.

flowchart LR
    MIC["Microphone /\naudio input"] --> STT["Speech-to-Text\n(STT / ASR)"]
    STT -->|text| AGENT["Agent\n(LLM + tools)"]
    AGENT -->|text| TTS["Text-to-Speech\n(TTS)"]
    TTS --> SPK["Speaker /\naudio output"]

The agent itself is text-in, text-out. Voice is a transport layer. This is why the OpenAI-compatible endpoint approach works: the agent doesn't need to know it's doing voice.

Speech-to-Text (STT)

Turning audio into text. The quality here determines whether the agent understands you.

Open models (run locally)

Model	Params	WER (LibriSpeech clean)	Languages	Notes
Whisper (OpenAI)	39M–1.5B	~2.7% (large-v3)	99	The default. Runs on CPU. `base` model is fast but noisy on accents. `large-v3-turbo` is the sweet spot. 10M+ downloads/month.
faster-whisper	same	same	99	CTranslate2 backend. 4x faster than OpenAI Whisper, lower memory. Drop-in replacement.
WhisperKit	same	same	99	CoreML-optimized for Apple Silicon. 10M downloads. Native Swift, runs on iPhone/Mac.
Parakeet (NVIDIA)	600M	1.69%	EN (v2), 25 EU (v3)	FastConformer-TDT. More accurate than Whisper, word timestamps, punctuation. Needs NeMo (CUDA). Free API at build.nvidia.com.
Nemotron-3.5 ASR (NVIDIA)	600M	~4% (streaming)	36	New (June 2026). Streaming ASR — transcribes in real-time, no silence detection needed. 17x more concurrent streams vs Parakeet RNNT 1.1B. ONNX + CoreML variants available. The real-time voice upgrade path.
Sherpa-ONNX	various	various	many	Lightweight ONNX runtime. Good for edge/mobile. No Python dependencies.

API services

Service	Latency	Price	Notes
OpenAI Whisper API	~1-3s	$0.006/min	Hosted Whisper. Simple, reliable.
Deepgram	~300ms	$0.0043/min (Nova-3)	Streaming support. Very low latency. Popular for real-time voice agents.
Google Cloud STT	~500ms	$0.006/min (Chirp 2)	128 languages. Chirp 2 is competitive with Whisper large.
AssemblyAI	~500ms	$0.0037/min (Universal-2)	Streaming, speaker diarization. Good for meetings/calls.
NVIDIA Riva (Parakeet)	~200ms	Free tier	gRPC API. Best accuracy numbers. Free for development.

SmolPaws today: Whisper base model, local, zero cost. Works but struggles with accents and short utterances. Upgrade path: faster-whisper large-v3-turbo locally, or Parakeet via NVIDIA API for best accuracy.

Text-to-Speech (TTS)

Turning agent text into natural-sounding speech. The quality gap between free and paid has narrowed dramatically.

Open models (run locally)

Model	Params	Quality	Speed	Notes
Kokoro	82M	Excellent	Real-time on CPU	14M downloads. Apache 2.0. Best quality-to-size ratio. Multiple voices. ONNX version available.
Higgs Audio v3 (Boson)	4B	Excellent	Sub-second TTFA	New (June 2026). 102 languages, 21 emotions, singing/whispering/shouting. Zero-shot voice cloning. Speaks, not just reads. Research/non-commercial license.
dots.tts (RedNote)	2B	High	Near real-time	New (June 2026). Fully continuous pipeline — no discrete codec tokens anywhere. AR + flow-matching over 48kHz AudioVAE. Trained on 1.5M hours. Apache 2.0.
Qwen3-TTS	0.6B–1.7B	High	~real-time	Custom voice cloning. 2M downloads. Multilingual. Apache 2.0.
F5-TTS	335M	High	Near real-time	Flow-matching. Zero-shot voice cloning from 10s reference. CC-BY-NC.
Piper	small	Good	Very fast	VITS-based. 44 languages. Runs on Raspberry Pi. Used by Home Assistant. MIT.
edge-tts	n/a	Good	Fast	Free, uses Microsoft Edge's TTS. No API key needed. `pip install edge-tts`.
macOS `say`	n/a	Decent	Instant	Built-in. Zero latency, zero cost. "Evan (Enhanced)" is the best English voice.

API services

Service	Quality	Price	Notes
ElevenLabs	Best	~$0.18/1K chars	Industry leader. 5K+ voices, 32 languages. Voice cloning. Also provides the full Conversational AI platform.
OpenAI TTS	Very good	$15/1M chars	Simple API. 6 voices. Good enough for most use cases.
Google Cloud TTS	Very good	$4–$16/1M chars	Journey voices are expressive. WaveNet for quality, Standard for cost.
MiniMax	Good	Low	Emerging. Used by Hermes Agent as a built-in provider.

SmolPaws today: macOS say -v "Evan (Enhanced)" for local speech + WhatsApp voice notes via the outbox. Upgrade path: Kokoro for excellent local TTS, or ElevenLabs for managed hosting with voice cloning.

Hermes Agent TTS providers

For reference, Hermes ships with 11 built-in TTS providers plus support for custom command-line providers:

edge, elevenlabs, openai, minimax, xai, mistral, gemini, neutts, kittentts, piper
+ any CLI command via config (e.g. "piper -m model.onnx -f {output_path}")

Real-Time Voice Conversation

STT and TTS give you voice messages. Real-time conversation adds: turn-taking, interruption handling, latency management, and continuous audio streaming. This is the hard part.

Two architectures

flowchart TB
    subgraph A ["Architecture 1: Cascaded (STT + LLM + TTS)"]
        direction LR
        A1["Audio in"] --> A2["STT"]
        A2 -->|text| A3["LLM / Agent"]
        A3 -->|text| A4["TTS"]
        A4 --> A5["Audio out"]
    end

    subgraph B ["Architecture 2: Native multimodal"]
        direction LR
        B1["Audio in"] --> B2["Multimodal LLM\n(audio-in, audio-out)"]
        B2 --> B3["Audio out"]
    end

	Cascaded (STT + LLM + TTS)	Native multimodal
Latency	Higher (3 serial steps)	Lower (single model)
Quality	Each component best-in-class	Improving fast
Flexibility	Mix and match components	Locked to one provider
Agent tools	LLM is text-based, tools work normally	Tool support varies
Examples	ElevenLabs Agents, Hermes Discord voice	OpenAI Realtime API, Gemini Live

For AI agents with tools, the cascaded approach currently wins: the LLM layer stays text-based, so tool calling works identically to text conversations. Native multimodal is catching up but tool integration is less mature.

Platform-native voice (build it yourself)

This is what Hermes does for Discord: join the voice channel, capture audio at the packet level, run the full pipeline locally.

flowchart LR
    VC["Discord\nVoice Channel"] -->|RTP packets| VR["VoiceReceiver\n(Opus decode,\nsilence detect)"]
    VR -->|PCM| W["Whisper\nSTT"]
    W -->|text| AG["Agent"]
    AG -->|text| TTS["TTS\n(ElevenLabs/\nOpenAI/edge)"]
    TTS -->|PCM| MX["VoiceMixer\n(ambient bed +\nspeech ducking)"]
    MX -->|Opus| VC

Components and what's generic vs platform-specific:

Component	Platform-specific?	What it does
Voice transport	Yes	Connect to voice channel, receive/send audio packets. Discord uses WebSocket + RTP. Other platforms have their own protocols.
Codec decode (Opus)	No	Standard audio codec. Same everywhere.
Silence detection	No	Detect end of utterance by amplitude threshold + timer (1.5s in Hermes).
STT (Whisper/Parakeet)	No	Transcribe audio to text. Any STT engine works.
Agent (LLM + tools)	No	Process text, run tools, generate response. Completely decoupled from voice.
TTS	No	Generate speech audio from text. Any TTS engine works.
Audio mixer	No	Mix ambient + speech with ducking. Pure DSP (numpy). Works on any PCM stream.
Codec encode (Opus)	No	Encode mixed audio back. Standard.

Only the transport layer is platform-specific. The rest is a generic voice engine that could be extracted and reused across platforms. Hermes hasn't done this extraction yet — it's all inside the Discord adapter — but architecturally, it's separable.

Voice-as-a-service (let someone else handle it)

This is the ElevenLabs approach: the voice platform handles STT, TTS, turn-taking, and phone integration. The agent just needs to expose an OpenAI-compatible /v1/chat/completions endpoint.

flowchart LR
    PH["Phone call /\nWeb widget /\nMobile app"] <-->|audio| EL["ElevenLabs\nConversational AI\n(STT + TTS +\nturn-taking)"]
    EL <-->|"POST /v1/chat/completions"| AG["Agent Server\n(OpenAI-compatible\nendpoint)"]
    AG --> RT["Agent Runtime\n(tools, memory,\nskills)"]

	Platform-native	Voice-as-a-service
Agent code changes	Significant (voice pipeline)	One HTTP endpoint
Phone calls	Need SIP/Twilio integration	Built-in (Twilio, SIP)
Latency control	Full (local processing)	Depends on service
Cost	Free (local compute)	Per-minute pricing
Voice quality	Your choice of TTS	ElevenLabs quality (best)
Platforms unlocked	One at a time	All at once (phone, web, mobile)

Key insight: The OpenAI-compatible endpoint is the convergence point. Both Hermes (PR #1756) and OpenHands (PR #3545) are building this. Once an agent exposes /v1/chat/completions, it's voice-ready without any voice code — ElevenLabs (or any voice frontend) handles the rest.

Voice Messages vs Real-Time Voice

An important distinction that affects architecture choices.

	Voice messages	Real-time voice
Examples	WhatsApp voice notes, Telegram voice messages, Slack audio clips	Discord voice channels, phone calls, ElevenLabs widget
Latency tolerance	Seconds to minutes (async)	Sub-second (conversational)
Implementation	Download audio file → STT → agent → TTS → send audio file	Continuous audio stream → turn detection → STT → agent → TTS → stream back
Turn-taking	Explicit (user records, sends)	Automatic (silence detection, interruption handling)
Complexity	Low (file-based)	High (streaming, real-time processing)

Most messaging platforms only support voice messages. Discord is the notable exception with real-time voice channels. Phone calls via ElevenLabs/Twilio are real-time. The voice-as-a-service approach handles real-time complexity for you.

What SmolPaws Has Today

Capability	Status	How
Receive voice (WhatsApp)	✅	Baileys downloads OGG → Whisper `base` transcribes
Send voice (WhatsApp)	✅	macOS `say` → ffmpeg → OGG Opus → voice outbox → baileys `ptt:true`
Local TTS	✅	macOS `say -v "Evan (Enhanced)"`
Local STT	✅	Whisper `base` via `~/.smolpaws/tools/whisper-env/`
Real-time voice	❌	Not yet. Planned via OpenAI-compatible endpoint + ElevenLabs.
Phone calls	❌	Not yet. Possible once `/v1/chat/completions` ships (PR #3545).
Discord voice	❌	Text only. Voice channel support scoped but not started.

Upgrade Paths

Better STT — replace Whisper base with faster-whisper large-v3-turbo (local, free) or NVIDIA Parakeet (API, free tier). For real-time streaming: Nemotron-3.5 ASR (36 languages, ONNX/CoreML available).
Better TTS — Kokoro 82M for lightweight local TTS. Higgs Audio v3 for 102 languages + emotions. dots.tts for fully continuous pipeline (Apache 2.0).
Real-time voice via ElevenLabs — once the OpenAI-compatible endpoint ships, connect ElevenLabs Conversational AI. Zero voice code needed. Phone calls via Twilio.
Discord voice channels — add @discordjs/voice + Opus bindings. The generic voice pipeline (~400 lines) is well-understood from the Hermes study.

References

OpenAI Whisper — open-source STT
NVIDIA Parakeet TDT 0.6B v2 — state-of-the-art English ASR
NVIDIA Nemotron-3.5 ASR — streaming multilingual ASR (June 2026)
Kokoro 82M — best open TTS model for its size
Boson Higgs Audio v3 — 102 languages, emotions, zero-shot cloning (June 2026)
dots.tts — fully continuous TTS, no codec tokens, Apache 2.0 (June 2026)
ElevenLabs Conversational AI — voice-as-a-service platform
OpenAI Realtime API — native multimodal voice
Hermes Gateway Architecture — our study of the gateway + Discord voice implementation
ElevenLabs + Hermes integration — the article that sparked this exploration
OpenHands issue #3540 — our proposal for /v1/chat/completions on the agent-server
OpenHands PR #3545 — the implementation

Hermes Gateway → · SmolPaws Slack → · ← Home