Multimodal fusion is still an open stitching problem
Unified models sound simple; in practice, tokenizers, sampling rates, and objectives diverge—creating seams where demos hide cracks.
Alignment across modalities
Video frames, audio frames, and text tokens live on different clocks. Models must decide what “simultaneous” means when captions lag audio or when scene cuts confuse object continuity. Failures often look like confident language about the wrong visual evidence —a cousin to hallucination in text-only systems.
Attention budgets
Cross-attention between streams adds compute; long media sequences stress memory—see self-attention and context windows. Latency for streaming video assistants intersects latency UX.
Relation to diffusion media
Image and video generation pipelines face analogous trade-offs between local detail and global coherence— diffusion essay—even when the architecture is not transformer-identical.
Evaluation beyond the glossy clip
Stress-test lighting shifts, accents, dialects, and noisy microphones—dimensions often absent from benchmarks.
Audio-visual synchronization metrics
Objective scores for lip-sync, word error rate on forced alignment, and frame-level event matching help regressions when teams change sampling or frame rates—beyond subjective demo viewing.
Accessibility and modality fallbacks
When one modality is missing (muted video, image-only input), models should degrade gracefully—explicit uncertainty and alternate prompts rather than confident guesses; ties to hallucination patterns.