Transcript service with Whisper

Modules

WhisperLive provides a server/client pattern for near-live transcription, which fits well when the audio source is a browser microphone and the user expects partial updates. WhisperLive also supports multiple inference backends (e.g., CPU/GPU-optimized implementations), which makes it practical to adapt to different hardware environments without changing the overall service contract.

Operationally, a few design knobs matter a lot for a stable service: whether you load one model per session or share a single model across clients, how you cap concurrent clients and enforce connection time limits, and whether you use voice activity detection (VAD) to avoid spending compute on silence.

Faster-Whisper is used as the transcription engine. It provides faster inference based on the optimization from CTranslate2.

There are many other variants based on Faster-Whisper, such as WhisperX that integrated speaker diarization and word-level timestamps based on wav2vec2 and it’s useful for online meeting transcription.