← Home

Voice for AI Agents landscape

What can be done with voice today, how the pieces fit together, and what it takes to give an agent a voice.
June 2026.

The Voice Pipeline

Every voice-enabled agent needs three components. The interesting part is how you compose them.

flowchart LR
    MIC["Microphone /\naudio input"] --> STT["Speech-to-Text\n(STT / ASR)"]
    STT -->|text| AGENT["Agent\n(LLM + tools)"]
    AGENT -->|text| TTS["Text-to-Speech\n(TTS)"]
    TTS --> SPK["Speaker /\naudio output"]
    

The agent itself is text-in, text-out. Voice is a transport layer. This is why the OpenAI-compatible endpoint approach works: the agent doesn't need to know it's doing voice.

Speech-to-Text (STT)

Turning audio into text. The quality here determines whether the agent understands you.

Open models (run locally)

ModelParamsWER (LibriSpeech clean)LanguagesNotes
Whisper (OpenAI)39M–1.5B~2.7% (large-v3)99The default. Runs on CPU. base model is fast but noisy on accents. large-v3-turbo is the sweet spot. 10M+ downloads/month.
faster-whispersamesame99CTranslate2 backend. 4x faster than OpenAI Whisper, lower memory. Drop-in replacement.
WhisperKitsamesame99CoreML-optimized for Apple Silicon. 10M downloads. Native Swift, runs on iPhone/Mac.
Parakeet (NVIDIA)600M1.69%EN (v2), 25 EU (v3)FastConformer-TDT. More accurate than Whisper, word timestamps, punctuation. Needs NeMo (CUDA). Free API at build.nvidia.com.
Nemotron-3.5 ASR (NVIDIA)600M~4% (streaming)36New (June 2026). Streaming ASR — transcribes in real-time, no silence detection needed. 17x more concurrent streams vs Parakeet RNNT 1.1B. ONNX + CoreML variants available. The real-time voice upgrade path.
Sherpa-ONNXvariousvariousmanyLightweight ONNX runtime. Good for edge/mobile. No Python dependencies.

API services

ServiceLatencyPriceNotes
OpenAI Whisper API~1-3s$0.006/minHosted Whisper. Simple, reliable.
Deepgram~300ms$0.0043/min (Nova-3)Streaming support. Very low latency. Popular for real-time voice agents.
Google Cloud STT~500ms$0.006/min (Chirp 2)128 languages. Chirp 2 is competitive with Whisper large.
AssemblyAI~500ms$0.0037/min (Universal-2)Streaming, speaker diarization. Good for meetings/calls.
NVIDIA Riva (Parakeet)~200msFree tiergRPC API. Best accuracy numbers. Free for development.

SmolPaws today: Whisper base model, local, zero cost. Works but struggles with accents and short utterances. Upgrade path: faster-whisper large-v3-turbo locally, or Parakeet via NVIDIA API for best accuracy.

Text-to-Speech (TTS)

Turning agent text into natural-sounding speech. The quality gap between free and paid has narrowed dramatically.

Open models (run locally)

ModelParamsQualitySpeedNotes
Kokoro82MExcellentReal-time on CPU14M downloads. Apache 2.0. Best quality-to-size ratio. Multiple voices. ONNX version available.
Higgs Audio v3 (Boson)4BExcellentSub-second TTFANew (June 2026). 102 languages, 21 emotions, singing/whispering/shouting. Zero-shot voice cloning. Speaks, not just reads. Research/non-commercial license.
dots.tts (RedNote)2BHighNear real-timeNew (June 2026). Fully continuous pipeline — no discrete codec tokens anywhere. AR + flow-matching over 48kHz AudioVAE. Trained on 1.5M hours. Apache 2.0.
Qwen3-TTS0.6B–1.7BHigh~real-timeCustom voice cloning. 2M downloads. Multilingual. Apache 2.0.
F5-TTS335MHighNear real-timeFlow-matching. Zero-shot voice cloning from 10s reference. CC-BY-NC.
PipersmallGoodVery fastVITS-based. 44 languages. Runs on Raspberry Pi. Used by Home Assistant. MIT.
edge-ttsn/aGoodFastFree, uses Microsoft Edge's TTS. No API key needed. pip install edge-tts.
macOS sayn/aDecentInstantBuilt-in. Zero latency, zero cost. "Evan (Enhanced)" is the best English voice.

API services

ServiceQualityPriceNotes
ElevenLabsBest~$0.18/1K charsIndustry leader. 5K+ voices, 32 languages. Voice cloning. Also provides the full Conversational AI platform.
OpenAI TTSVery good$15/1M charsSimple API. 6 voices. Good enough for most use cases.
Google Cloud TTSVery good$4–$16/1M charsJourney voices are expressive. WaveNet for quality, Standard for cost.
MiniMaxGoodLowEmerging. Used by Hermes Agent as a built-in provider.

SmolPaws today: macOS say -v "Evan (Enhanced)" for local speech + WhatsApp voice notes via the outbox. Upgrade path: Kokoro for excellent local TTS, or ElevenLabs for managed hosting with voice cloning.

Hermes Agent TTS providers

For reference, Hermes ships with 11 built-in TTS providers plus support for custom command-line providers:

edge, elevenlabs, openai, minimax, xai, mistral, gemini, neutts, kittentts, piper
+ any CLI command via config (e.g. "piper -m model.onnx -f {output_path}")

Real-Time Voice Conversation

STT and TTS give you voice messages. Real-time conversation adds: turn-taking, interruption handling, latency management, and continuous audio streaming. This is the hard part.

Two architectures

flowchart TB
    subgraph A ["Architecture 1: Cascaded (STT + LLM + TTS)"]
        direction LR
        A1["Audio in"] --> A2["STT"]
        A2 -->|text| A3["LLM / Agent"]
        A3 -->|text| A4["TTS"]
        A4 --> A5["Audio out"]
    end

    subgraph B ["Architecture 2: Native multimodal"]
        direction LR
        B1["Audio in"] --> B2["Multimodal LLM\n(audio-in, audio-out)"]
        B2 --> B3["Audio out"]
    end
    
Cascaded (STT + LLM + TTS)Native multimodal
LatencyHigher (3 serial steps)Lower (single model)
QualityEach component best-in-classImproving fast
FlexibilityMix and match componentsLocked to one provider
Agent toolsLLM is text-based, tools work normallyTool support varies
ExamplesElevenLabs Agents, Hermes Discord voiceOpenAI Realtime API, Gemini Live

For AI agents with tools, the cascaded approach currently wins: the LLM layer stays text-based, so tool calling works identically to text conversations. Native multimodal is catching up but tool integration is less mature.

Platform-native voice (build it yourself)

This is what Hermes does for Discord: join the voice channel, capture audio at the packet level, run the full pipeline locally.

flowchart LR
    VC["Discord\nVoice Channel"] -->|RTP packets| VR["VoiceReceiver\n(Opus decode,\nsilence detect)"]
    VR -->|PCM| W["Whisper\nSTT"]
    W -->|text| AG["Agent"]
    AG -->|text| TTS["TTS\n(ElevenLabs/\nOpenAI/edge)"]
    TTS -->|PCM| MX["VoiceMixer\n(ambient bed +\nspeech ducking)"]
    MX -->|Opus| VC
    

Components and what's generic vs platform-specific:

ComponentPlatform-specific?What it does
Voice transportYesConnect to voice channel, receive/send audio packets. Discord uses WebSocket + RTP. Other platforms have their own protocols.
Codec decode (Opus)NoStandard audio codec. Same everywhere.
Silence detectionNoDetect end of utterance by amplitude threshold + timer (1.5s in Hermes).
STT (Whisper/Parakeet)NoTranscribe audio to text. Any STT engine works.
Agent (LLM + tools)NoProcess text, run tools, generate response. Completely decoupled from voice.
TTSNoGenerate speech audio from text. Any TTS engine works.
Audio mixerNoMix ambient + speech with ducking. Pure DSP (numpy). Works on any PCM stream.
Codec encode (Opus)NoEncode mixed audio back. Standard.

Only the transport layer is platform-specific. The rest is a generic voice engine that could be extracted and reused across platforms. Hermes hasn't done this extraction yet — it's all inside the Discord adapter — but architecturally, it's separable.

Voice-as-a-service (let someone else handle it)

This is the ElevenLabs approach: the voice platform handles STT, TTS, turn-taking, and phone integration. The agent just needs to expose an OpenAI-compatible /v1/chat/completions endpoint.

flowchart LR
    PH["Phone call /\nWeb widget /\nMobile app"] <-->|audio| EL["ElevenLabs\nConversational AI\n(STT + TTS +\nturn-taking)"]
    EL <-->|"POST /v1/chat/completions"| AG["Agent Server\n(OpenAI-compatible\nendpoint)"]
    AG --> RT["Agent Runtime\n(tools, memory,\nskills)"]
    
Platform-nativeVoice-as-a-service
Agent code changesSignificant (voice pipeline)One HTTP endpoint
Phone callsNeed SIP/Twilio integrationBuilt-in (Twilio, SIP)
Latency controlFull (local processing)Depends on service
CostFree (local compute)Per-minute pricing
Voice qualityYour choice of TTSElevenLabs quality (best)
Platforms unlockedOne at a timeAll at once (phone, web, mobile)

Key insight: The OpenAI-compatible endpoint is the convergence point. Both Hermes (PR #1756) and OpenHands (PR #3545) are building this. Once an agent exposes /v1/chat/completions, it's voice-ready without any voice code — ElevenLabs (or any voice frontend) handles the rest.

Voice Messages vs Real-Time Voice

An important distinction that affects architecture choices.

Voice messagesReal-time voice
ExamplesWhatsApp voice notes, Telegram voice messages, Slack audio clipsDiscord voice channels, phone calls, ElevenLabs widget
Latency toleranceSeconds to minutes (async)Sub-second (conversational)
ImplementationDownload audio file → STT → agent → TTS → send audio fileContinuous audio stream → turn detection → STT → agent → TTS → stream back
Turn-takingExplicit (user records, sends)Automatic (silence detection, interruption handling)
ComplexityLow (file-based)High (streaming, real-time processing)

Most messaging platforms only support voice messages. Discord is the notable exception with real-time voice channels. Phone calls via ElevenLabs/Twilio are real-time. The voice-as-a-service approach handles real-time complexity for you.

What SmolPaws Has Today

CapabilityStatusHow
Receive voice (WhatsApp)Baileys downloads OGG → Whisper base transcribes
Send voice (WhatsApp)macOS say → ffmpeg → OGG Opus → voice outbox → baileys ptt:true
Local TTSmacOS say -v "Evan (Enhanced)"
Local STTWhisper base via ~/.smolpaws/tools/whisper-env/
Real-time voiceNot yet. Planned via OpenAI-compatible endpoint + ElevenLabs.
Phone callsNot yet. Possible once /v1/chat/completions ships (PR #3545).
Discord voiceText only. Voice channel support scoped but not started.

Upgrade Paths

  1. Better STT — replace Whisper base with faster-whisper large-v3-turbo (local, free) or NVIDIA Parakeet (API, free tier). For real-time streaming: Nemotron-3.5 ASR (36 languages, ONNX/CoreML available).
  2. Better TTSKokoro 82M for lightweight local TTS. Higgs Audio v3 for 102 languages + emotions. dots.tts for fully continuous pipeline (Apache 2.0).
  3. Real-time voice via ElevenLabs — once the OpenAI-compatible endpoint ships, connect ElevenLabs Conversational AI. Zero voice code needed. Phone calls via Twilio.
  4. Discord voice channels — add @discordjs/voice + Opus bindings. The generic voice pipeline (~400 lines) is well-understood from the Hermes study.

References

Hermes Gateway →  ·  SmolPaws Slack →  ·  ← Home