← writing  /  Intelligent Video Captioning
crosspost · mosaic · ml

Intelligent Video Captioning

· by Shiv Trivedi

Originally published on the Mosaic blog. Mirrored here for archival; the Mosaic version is the canonical copy.

Closed Captioning symbol

Creating effective video captions is more complex than simply placing text at the bottom of the screen. Good captions must respect natural speech patterns, avoid obscuring important visual content, and enhance rather than distract from the viewing experience. After extensive research and experimentation, I developed an AI-assisted system that creates contextually-aware, dynamically-positioned captions that feel natural and never obstruct crucial visual information.

The solution employs a sophisticated three-phase workflow combining computer vision, natural language processing, and AI-powered design to generate captions that work harmoniously with video content.

The Three-Phase Pipeline: Markup → Analyze → Emplace

Each phase tackles a distinct challenge in the captioning problem, from understanding speech flow to finding optimal visual placement.

Phase 1: Markup — Intelligent Boundary Detection

The first phase performs comprehensive video analysis to determine where caption segments should begin and end. This goes far beyond simple sentence boundaries — effective captions must respect the natural rhythm of speech while accommodating visual constraints.

Advanced Speech Recognition and Transcription

The system uses Deepgram's API to generate detailed transcriptions with:

Multi-Modal Boundary Analysis

A specialized BoundaryAnalyzer component employs multiple detection methods working in concert:

Audio Pattern Recognition:

Visual Scene Analysis:

Linguistic Features:

Intelligent Scoring and Segmentation

The system combines all features through a sophisticated weighted scoring system:

gap_score = (
    0.2 * gap_duration +
    0.4 * audio_alignment_score +
    0.3 * punctuation_weight +
    0.1 * phrase_starter_weight
)

As shown in the boundary analysis visualization, the system successfully identifies natural breaks (green lines) while strategically placing forced breaks (orange lines) when segments exceed optimal length limits. Scene changes (red dashed lines) force caption boundaries to ensure text doesn't obscure important visual transitions.

Caption Boundary Analysis chart

The boundary analysis reveals how the system balances natural speech patterns with practical constraints, achieving 57.1% natural breaks while maintaining readability.

The Research Journey: From Complex Models to Elegant Solutions

Before arriving at the final composite frame approach, I explored several sophisticated computer vision techniques, each revealing important insights about the caption placement problem.

Initial Approach: YOLO + Visual Saliency with Temporal Tracking

My first attempt combined semantic object detection with visual saliency mapping and added a sophisticated temporal tracking component. The process worked as follows:

  1. Identify Important Regions — YOLO detected semantically important objects (people, text, key items)
  2. Generate Saliency Maps — Visual saliency models predicted where viewers would naturally look
  3. Find Dead Regions — Combined these analyses to identify "dead space" — areas with low saliency and no important objects
  4. Temporal Tracking — Tracked these dead regions across consecutive frames throughout each caption segment
  5. Optimal Zone Selection — Found areas that remained consistently "dead" for the entire caption duration

YOLO + saliency dead-region analysis

The temporal tracking was particularly complex, involving frame-by-frame region correspondence analysis, stability scoring to ensure placement areas remained visually unimportant throughout the entire caption duration, and sophisticated filtering to handle edge cases like partial occlusions and gradual scene transitions.

The system performed extensive analysis to ensure optimal placement:

Despite this technical sophistication, the approach revealed critical limitations:

Advanced Saliency Models

I then experimented with more sophisticated approaches:

While these methods produced impressive technical results, they suffered from critical practical limitations:

The Breakthrough Realization

The key insight came from recognizing that caption placement isn't about understanding individual frames — it's about understanding what remains constant versus what changes throughout an entire caption segment.

Traditional computer vision approaches were solving the wrong problem. YOLO could identify objects, saliency models could predict attention, but none addressed the fundamental question: "Where can text safely exist for the entire duration of this caption without interfering with important visual information?"

This realization led to abandoning complex multi-model pipelines in favor of a surprisingly simple approach: temporal frame averaging. Instead of trying to understand what's important in each frame, the system lets time itself reveal the answer.

Why Simpler Worked Better:

This represents a fundamental shift from analytical complexity to temporal simplicity — letting the video's own motion patterns reveal the optimal placement strategy.

Phase 2: Analyze — Composite Frame Technique

Moving beyond complex multi-model approaches, this phase employs an innovative temporal frame averaging technique that elegantly solves the caption placement problem through simplicity rather than complexity.

The Composite Frame Innovation

For each caption segment timespan, the system extracts all frames within that duration and creates a composite image by averaging them together. This creates a "long exposure" effect that reveals:

Temporal composite of a single-subject scene

Computational Efficiency

Unlike complex object detection systems requiring heavy neural networks, this approach needs only basic frame averaging — making it computationally efficient while achieving sophisticated results. The system processes specific frame ranges corresponding to each caption segment's timing, enabling:

Automatic Safe Zone Identification

The composite frame approach elegantly solves caption placement by making visually important areas naturally visible while motion areas become transparent. This composite effectively becomes a mask showing where text can be safely placed without obscuring important content — a solution that emerged from the visual characteristics of temporal averaging rather than complex algorithmic detection.

Motion composite example

A real example showing the composite frame technique in practice. The temporal averaging creates a natural ghosting effect where moving subjects (the two speakers) become semi-transparent while static elements (background, whiteboard) remain clear, automatically revealing optimal caption placement zones.

Phase 3: Emplace — AI-Powered Design and Animation

The final phase combines artificial intelligence with precise positioning to create professional-quality caption overlays.

AI-Generated SVG Creation

For each caption segment, the system uses Google's Gemini AI to generate custom SVG graphics based on the composite frame analysis. The multimodal AI receives the composite frame as visual context and:

The AI has access to specialized tools that enable iterative testing of different positions and styles before finalizing placement decisions.

Dynamic Word-Level Animation

Word-level timing data from Deepgram drives sophisticated SVG animations created by Gemini:

Example of dynamic word-level animation with synchronized timing and effects.

Adaptive Positioning System

Unlike traditional center-positioned captions, this system places text in the most visually appropriate location for each specific segment. Captions can appear at the top, bottom, sides, or even split across multiple regions based on what the composite frame analysis reveals.

Multi-Format Integration

The system converts SVG animations to PNG overlays when needed and integrates seamlessly with existing video editing workflows through composition operations.

Technical Architecture: LangGraph Orchestration

The entire workflow is built using LangGraph for orchestration, enabling:

Key Technical Innovations

1. Composite Frame Technique

The breakthrough innovation is using temporal averaging to understand spatial relationships:

2. Multi-Modal Boundary Detection

Combining audio analysis, visual scene detection, and linguistic processing creates more natural caption segmentation than any single approach.

3. AI-Assisted Design

Using multimodal AI for aesthetic decisions produces results that feel professionally designed rather than algorithmically generated.

Results and Performance

The three-stage approach achieved:

Technical Insights and Lessons Learned

Simple temporal analysis outperformed complex spatial models. After extensive experimentation with YOLO object detection, multiple saliency models, and temporal attention mechanisms, the breakthrough came from recognizing that caption placement is not only a spatial problem, but also a temporal one.

Computational efficiency enables practical deployment. While deep learning approaches (YOLO + saliency) produced impressive research results, the composite frame technique's minimal computational requirements make it viable for production use without specialized hardware.

Motion reveals more than attention models. Traditional saliency models attempt to predict where viewers will look, but for caption placement, understanding where motion occurs (and therefore where text would be distracting) proved more valuable than predicting attention patterns.

Natural speech boundaries outperform grammatical rules. Audio analysis consistently produced better segmentation than pure linguistic approaches because it captures the speaker's intended emphasis and pacing.

AI excels at aesthetic integration. While algorithmic approaches could identify safe zones, multimodal AI made superior decisions about typography, positioning, and visual harmony.

Parallel processing enables scalability. The async architecture allows the system to handle long-form videos efficiently by processing multiple segments simultaneously.

Looking Forward

This approach opens possibilities for adaptive captioning that responds to content context, viewer preferences, and accessibility requirements. The modular pipeline makes it easy to experiment with different boundary detection methods or placement strategies while maintaining the overall workflow.

The key insight is that effective captioning requires understanding multiple modalities simultaneously — audio patterns, visual composition, linguistic structure, and aesthetic principles all contribute to the final result. By building systems that can reason across these different domains, we create captions that truly enhance rather than merely accompany video content.

The composite frame technique, in particular, represents a paradigm shift from complex object detection to elegant temporal analysis. This approach demonstrates that sometimes the most sophisticated solutions emerge from understanding the fundamental characteristics of the problem rather than applying increasingly complex algorithms.


→ Read the original post on the Mosaic blog

— end —