whisper-vtt2srt
Input (VTT) Paste content
Output (SRT) Read Only

What is WebVTT?

WebVTT (Web Video Text Tracks) is the W3C standard for displaying timed text synchronized with HTML5 <video> elements. It is an evolution of the older SRT format, adding advanced features like text alignment, positioning, and CSS styling support.

While powerful, VTT files generated by automated tools often contain complex metadata and "Karaoke-style" partial updates that are difficult for standard players or Text-to-Speech engines to handle. For more technical details, you can visit the official MDN WebVTT API Documentation.

Clean Whisper Subtitles & yt-dlp Workflow

If you use OpenAI Whisper for transcription or yt-dlp to download YouTube auto-captions, you've likely encountered the "stuttering" effect where text accumulates line-by-line. This Karaoke Deduplication is exactly what this tool solves.

The Modern Subtitle Pipeline:

  • Ingestion: Use yt-dlp to fetch high-quality VTT streams.
  • Transcription: Process audio through Whisper to get ultra-accurate time-anchored text.
  • Normalization: Use whisper-vtt2srt to convert the messy VTT into a clean, human-readable SRT.

Why Clean VTT for TTS?

Text-to-Speech (TTS) engines like ElevenLabs, OpenAI Audio, or Coqui TTS require clean, punctuation-normalized text. Tags like <c>, <b>, and sound descriptions such as [Music] or (Laughter) can confuse the AI voice, leading to unnatural pauses or literal reading of bracketed text.

Our converter acts as a subtitle preprocessor, stripping away the noise and ensuring the text is perfectly formatted for AI voice generation, video dubbing, or high-quality captioning.

Ready to automate?

Install the Python package for batch processing millions of files.

pip install whisper-vtt2srt
packages = ["whisper-vtt2srt==0.1.2"]