Architecture

Technical overview of VideoCaptioner's architecture, processing pipeline, and development setup.

Processing Pipeline

1Audio/Video Input
2Speech Recognition
3Subtitle Segmentation
4LLM Optimization
5Translation
6Video Synthesis

Technical Highlights

  • Word-level timestamps — Precise timing from Whisper with VAD refinement
  • Semantic segmentation — LLM-driven sentence breaks based on meaning, not just timing
  • Context-sensitive translation — Full subtitle context passed to LLM for coherent translation
  • Concurrent batch processing — Multiple LLM requests in parallel for speed
  • FFmpeg integration — Professional video synthesis with multiple output formats

Technology Stack

ComponentTechnology
LanguagePython 3.10+
GUI FrameworkPyQt5
ASR EnginesFasterWhisper, WhisperCpp, Whisper API, Bijian, Jianying
LLM InterfaceOpenAI-compatible API (any provider)
TranslationLLM, Google Translate, Bing Translate, DeepLX
Video ProcessingFFmpeg
Downloadyt-dlp + aria2

Development Setup

Terminal
git clone https://github.com/WEIFENG2333/VideoCaptioner.git
cd VideoCaptioner

# Install with uv (recommended)
uv sync

# Run GUI
uv run videocaptioner

# Run CLI
uv run videocaptioner --help

# Type checking
uv run pyright

# Run tests
uv run pytest tests/test_cli/ -q

Project Stats

MetricValue
GitHub Stars13.8k+
Forks1.1k+
Contributors9
LicenseGPL-3.0

Contributing

Contributions are welcome! To get started:

  1. Fork the repository on GitHub
  2. Create a feature branch: git checkout -b feature/my-feature
  3. Make your changes and write tests
  4. Run uv run pyright and uv run pytest to verify
  5. Submit a pull request

For bug reports and feature requests, use GitHub Issues.