Overall connection diagram
I deployed this service on an AWS EC2 instance. To ensure external accessibility, a custom domain is configured and Nginx is set up as a reverse proxy. The following diagram illustrates the deployment architecture.
Kyutai is an interesting AI Lab that released some interesting models such Moshi
In this cascaded example, the speech-to-text, text-to-speech models are tailored to the models released by Kyutai, especially the text-to-speech models is heavily fused into the backend engine which is written in Rust. The LLM is served with vLLM. To reduce latency, the text-to-speech generates tokens in a streaming fashion, which is a common approach but also may sacrifice controllability of overall sentence like intonation, emotion, etc.
Some key conversational elements like: backchannel, filler words, are not explicitly handled. The backchannel is implicitly handled by the LLM. But interruption are handled explicitly with a VAD function built inside the speech-to-text model.
Iβve modified the unmute repo into my own version into my repo talker (https://github.com/jianboma/talker/tree/main).
There are still several issues to address. For example, the character personas change inconsistently during conversations, and the bot occasionally responds with nonsensical answers. Additionally, running all three models on an A10 GPU does not provide sufficient speed for smooth performance.
Despite these challenges, the systemβs overall schema, particularly its approach to handling interruptions and determining when to start and stop generating speech, is well designed and brings the project close to a production-ready state.
Further improvements could focus on fine-tuning a compact LLM specifically for conversational AI, ideally with integrated tool-use capabilities to enable more agentic interactions.