Speech Recognition (ASR)

Configure the speech recognition engine that converts audio to text. Choose between local and cloud-based options.

Supported Engines

EngineTypeLanguagesBest For
FasterWhisper (recommended)Local99 languagesBest accuracy, GPU acceleration
WhisperCppLocal99 languagesLightweight, CPU-friendly
Whisper APICloud (OpenAI)99 languagesNo local models needed
BijianCloud (free)Chinese, EnglishQuick testing, no setup
JianyingCloud (free)Chinese, EnglishAlternative free option

FasterWhisper Setup

The recommended engine for production use. Offers the best accuracy with optional GPU acceleration.

Terminal
# Use FasterWhisper with Large-v2 model
videocaptioner transcribe video.mp4 --asr faster-whisper --model large-v2

# With specific language
videocaptioner transcribe video.mp4 --asr faster-whisper --language en

Model Selection

ModelSizeSpeedRecommended For
tiny~75 MBFastestQuick testing only
small~460 MBFastEnglish content
medium~1.5 GBModerateChinese content
large-v2~3 GBSlowerBest accuracy (recommended)
large-v3~3 GBSlowerNewer, but v2 often more stable

VAD (Voice Activity Detection)

VAD filters out silence segments, reducing hallucinations and improving accuracy.

Terminal
videocaptioner transcribe video.mp4 --asr faster-whisper --vad silero-v4

Silero V4 is the recommended VAD model. Always enable VAD for best results.

Audio Separation

For videos with background music or noise, enable audio separation to isolate the speech track before transcription:

Terminal
videocaptioner transcribe video.mp4 --enable-vocal-separation

When to enable audio separation

Use this when the video has significant background music, multiple speakers talking over each other, or environmental noise. It adds processing time but greatly improves transcription accuracy in noisy environments.

Cloud ASR (No Setup)

For quick testing without downloading models:

Terminal
# Free Bijian ASR (Chinese/English only)
videocaptioner transcribe video.mp4 --asr bijian

# OpenAI Whisper API (requires API key)
videocaptioner transcribe video.mp4 --asr whisper-api