Speech Recognition (ASR)

Configure the speech recognition engine that converts audio to text. Choose between local and cloud-based options.

Supported Engines

Engine	Type	Languages	Best For
FasterWhisper (recommended)	Local	99 languages	Best accuracy, GPU acceleration
WhisperCpp	Local	99 languages	Lightweight, CPU-friendly
Whisper API	Cloud (OpenAI)	99 languages	No local models needed
Bijian	Cloud (no key)	Chinese, English	Quick testing, no setup
Jianying	Cloud (no key)	Chinese, English	Alternative option

FasterWhisper Setup

The recommended engine for production use. Offers the best accuracy with optional GPU acceleration.

Terminal

# Use FasterWhisper with Large-v2 model
videocaptioner transcribe video.mp4 --asr faster-whisper --model large-v2

# With specific language
videocaptioner transcribe video.mp4 --asr faster-whisper --language en

Model Selection

Model	Size	Speed	Recommended For
tiny	~75 MB	Fastest	Quick testing only
small	~460 MB	Fast	English content
medium	~1.5 GB	Moderate	Chinese content
large-v2	~3 GB	Slower	Best accuracy (recommended)
large-v3	~3 GB	Slower	Newer, but v2 often more stable

VAD (Voice Activity Detection)

VAD filters out silence segments, reducing hallucinations and improving accuracy.

Terminal

videocaptioner transcribe video.mp4 --asr faster-whisper --vad silero-v4

Silero V4 is the recommended VAD model. Always enable VAD for best results.

Audio Separation

For videos with background music or noise, enable audio separation to isolate the speech track before transcription:

Terminal

videocaptioner transcribe video.mp4 --enable-vocal-separation

When to enable audio separation

Use this when the video has significant background music, multiple speakers talking over each other, or environmental noise. It adds processing time but greatly improves transcription accuracy in noisy environments.

Cloud ASR (No Setup)

For quick testing without downloading models:

Terminal

# Bijian ASR (Chinese/English only, no API key)
videocaptioner transcribe video.mp4 --asr bijian

# OpenAI Whisper API (requires API key)
videocaptioner transcribe video.mp4 --asr whisper-api