Speech Recognition (ASR)
Configure the speech recognition engine that converts audio to text. Choose between local and cloud-based options.
Supported Engines
| Engine | Type | Languages | Best For |
|---|---|---|---|
| FasterWhisper (recommended) | Local | 99 languages | Best accuracy, GPU acceleration |
| WhisperCpp | Local | 99 languages | Lightweight, CPU-friendly |
| Whisper API | Cloud (OpenAI) | 99 languages | No local models needed |
| Bijian | Cloud (free) | Chinese, English | Quick testing, no setup |
| Jianying | Cloud (free) | Chinese, English | Alternative free option |
FasterWhisper Setup
The recommended engine for production use. Offers the best accuracy with optional GPU acceleration.
Terminal
# Use FasterWhisper with Large-v2 model
videocaptioner transcribe video.mp4 --asr faster-whisper --model large-v2
# With specific language
videocaptioner transcribe video.mp4 --asr faster-whisper --language en
Model Selection
| Model | Size | Speed | Recommended For |
|---|---|---|---|
| tiny | ~75 MB | Fastest | Quick testing only |
| small | ~460 MB | Fast | English content |
| medium | ~1.5 GB | Moderate | Chinese content |
| large-v2 | ~3 GB | Slower | Best accuracy (recommended) |
| large-v3 | ~3 GB | Slower | Newer, but v2 often more stable |
VAD (Voice Activity Detection)
VAD filters out silence segments, reducing hallucinations and improving accuracy.
Terminal
videocaptioner transcribe video.mp4 --asr faster-whisper --vad silero-v4
Silero V4 is the recommended VAD model. Always enable VAD for best results.
Audio Separation
For videos with background music or noise, enable audio separation to isolate the speech track before transcription:
Terminal
videocaptioner transcribe video.mp4 --enable-vocal-separation
When to enable audio separation
Use this when the video has significant background music, multiple speakers talking over each other, or environmental noise. It adds processing time but greatly improves transcription accuracy in noisy environments.
Cloud ASR (No Setup)
For quick testing without downloading models:
Terminal
# Free Bijian ASR (Chinese/English only)
videocaptioner transcribe video.mp4 --asr bijian
# OpenAI Whisper API (requires API key)
videocaptioner transcribe video.mp4 --asr whisper-api