Architecture

Technical overview of VideoCaptioner's architecture, processing pipeline, and development setup.

Processing Pipeline

1Audio/Video Input

→

2Speech Recognition

→

3Subtitle Segmentation

→

4LLM Optimization

→

5Translation

→

6Video Synthesis

Technical Highlights

Word-level timestamps — Precise timing from Whisper with VAD refinement
Semantic segmentation — LLM-driven sentence breaks based on meaning, not just timing
Context-sensitive translation — Full subtitle context passed to LLM for coherent translation
Concurrent batch processing — Multiple LLM requests in parallel for speed
FFmpeg integration — Professional video synthesis with multiple output formats

Technology Stack

Component	Technology
Language	Python 3.10+
GUI Framework	PyQt5
ASR Engines	FasterWhisper, WhisperCpp, Whisper API, Bijian, Jianying
LLM Interface	OpenAI-compatible API (any provider)
Translation	LLM, Google Translate, Bing Translate, DeepLX
Video Processing	FFmpeg
Download	yt-dlp + aria2

Development Setup

Terminal

git clone https://github.com/WEIFENG2333/VideoCaptioner.git
cd VideoCaptioner

# Install with uv (recommended)
uv sync

# Run GUI
uv run videocaptioner

# Run CLI
uv run videocaptioner --help

# Type checking
uv run pyright

# Run tests
uv run pytest tests/test_cli/ -q

Project Stats

Metric	Value
GitHub Stars	13.8k+
Forks	1.1k+
Contributors	9
License	GPL-3.0

Contributing

Contributions are welcome! To get started:

Fork the repository on GitHub
Create a feature branch: git checkout -b feature/my-feature
Make your changes and write tests
Run uv run pyright and uv run pytest to verify
Submit a pull request

For bug reports and feature requests, use GitHub Issues.