A conversatinal bot with unmute

High level diagram

Overall connection diagram

HTTP GET /v1/health
WebSocket /v1/realtime
WebSocket
WebSocket
HTTP REST
HTTP REST
πŸ‘€ User
🌐 Browser
Next.js Frontend
Port 3000
βš™οΈ Main Backend
FastAPI
Port 8000
🎀 STT Service
Port 8090
πŸ”Š TTS Service
Port 8089
🧠 LLM Service
Port 8091
🎭 Voice Cloning
Port 8092

I deployed this service on an AWS EC2 instance. To ensure external accessibility, a custom domain is configured and Nginx is set up as a reverse proxy. The following diagram illustrates the deployment architecture.

/ -> Frontend
/v1/* -> Backend API
/v1/realtime -> WebSocket
WebSocket
WebSocket
HTTP REST
HTTP REST
πŸ‘€ User
🌐 Browser
https://IP_ADDRESS
πŸ”€ Nginx
Port 443 HTTPS
IP_ADDRESS
🌐 Frontend
Next.js
localhost:3000
βš™οΈ Backend
FastAPI
localhost:8000
🎀 STT Service
localhost:8090
πŸ”Š TTS Service
localhost:8089
🧠 LLM Service
localhost:8091
🎭 Voice Cloning
localhost:8092

Modules

Kyutai is an interesting AI Lab that released some interesting models such Moshi. Unlike what they has shown in Moshi, which is an end-to-end spoken dialogue model, unmute is a cascaded version that contains three modules: speech-to-text module, LLM, text-to-speech module.

In this cascaded example, the speech-to-text, text-to-speech models are tailored to the models released by Kyutai, especially the text-to-speech models is heavily fused into the backend engine which is written in Rust. The LLM is served with vLLM. To reduce latency, the text-to-speech generates tokens in a streaming fashion, which is a common approach but also may sacrifice controllability of overall sentence like intonation, emotion, etc.

Some key conversational elements like: backchannel, filler words, are not explicitly handled. The backchannel is implicitly handled by the LLM. But interruption are handled explicitly with a VAD function built inside the speech-to-text model.

Overall impression

I’ve modified the unmute repo into my own version into my repo talker (https://github.com/jianboma/talker/tree/main).

There are still several issues to address. For example, the character personas change inconsistently during conversations, and the bot occasionally responds with nonsensical answers. Additionally, running all three models on an A10 GPU does not provide sufficient speed for smooth performance.

Despite these challenges, the system’s overall schema, particularly its approach to handling interruptions and determining when to start and stop generating speech, is well designed and brings the project close to a production-ready state.

Further improvements could focus on fine-tuning a compact LLM specifically for conversational AI, ideally with integrated tool-use capabilities to enable more agentic interactions.