WhisperX: time-accurate speech transcription of long-form audio

Bain M, Huh JS, Han T, Zisserman A

18 August 2023

Conference paper

International Speech Communication Association

pp.

4489 - 4493

Large-scale, weakly-supervised speech recognition models,
such as Whisper, have demonstrated impressive results on
speech recognition across domains and languages. However,
their application to long audio transcription via buffered or sliding window approaches is prone to drifting, hallucination &
repetition; and prohibits batched transcription due to their sequential nature. Further, timestamps corresponding each utterance are prone to inaccuracies and word-level timestamps are
not available out-of-the-box. To overcome these challenges,
we present WhisperX, a time-accurate speech recognition system with word-level timestamps utilising voice activity detection and forced phoneme alignment. In doing so, we demonstrate state-of-the-art performance on long-form transcription
and word segmentation benchmarks. Additionally, we show
that pre-segmenting audio with our proposed VAD Cut & Merge
strategy improves transcription quality and enables a twelvefold transcription speedup via batched inference. The code is
available open-source.

DOI

10.21437/Interspeech.2023-78

ORA record

WhisperX: time-accurate speech transcription of long-form audio

Departments

Contact

Follow us