Architecture
Technical overview of VideoCaptioner's architecture, processing pipeline, and development setup.
Processing Pipeline
1Audio/Video Input
→
2Speech Recognition
→
3Subtitle Segmentation
→
4LLM Optimization
→
5Translation
→
6Video Synthesis
Technical Highlights
- Word-level timestamps — Precise timing from Whisper with VAD refinement
- Semantic segmentation — LLM-driven sentence breaks based on meaning, not just timing
- Context-sensitive translation — Full subtitle context passed to LLM for coherent translation
- Concurrent batch processing — Multiple LLM requests in parallel for speed
- FFmpeg integration — Professional video synthesis with multiple output formats
Technology Stack
| Component | Technology |
|---|---|
| Language | Python 3.10+ |
| GUI Framework | PyQt5 |
| ASR Engines | FasterWhisper, WhisperCpp, Whisper API, Bijian, Jianying |
| LLM Interface | OpenAI-compatible API (any provider) |
| Translation | LLM, Google Translate, Bing Translate, DeepLX |
| Video Processing | FFmpeg |
| Download | yt-dlp + aria2 |
Development Setup
Terminal
git clone https://github.com/WEIFENG2333/VideoCaptioner.git
cd VideoCaptioner
# Install with uv (recommended)
uv sync
# Run GUI
uv run videocaptioner
# Run CLI
uv run videocaptioner --help
# Type checking
uv run pyright
# Run tests
uv run pytest tests/test_cli/ -q
Project Stats
| Metric | Value |
|---|---|
| GitHub Stars | 13.8k+ |
| Forks | 1.1k+ |
| Contributors | 9 |
| License | GPL-3.0 |
Contributing
Contributions are welcome! To get started:
- Fork the repository on GitHub
- Create a feature branch:
git checkout -b feature/my-feature - Make your changes and write tests
- Run
uv run pyrightanduv run pytestto verify - Submit a pull request
For bug reports and feature requests, use GitHub Issues.