Diffusion: denoising as a controlled hallucination
Iterative denoising turns noise into structure; prompts steer trajectories—often sensitively—while objectives only approximate real-world physics.
The trajectory view
Diffusion models learn to reverse a forward noising process. Each step nudges pixels or latents toward configurations consistent with training data and conditioning signals (text embeddings, control nets, etc.). Small prompt edits can reroute the entire path—not because the model “misunderstood English,” but because conditioning biases early steps that constrain everything downstream.
Video: time becomes another failure surface
Temporal coherence competes with per-frame detail; artifacts (hands, object drift) often signal objective mismatch, not mere randomness. Multimodal systems that stitch vision and language share analogous seams—see multimodal fusion—even when backbones differ.
Conceptual link to language
Language models also roll out token trajectories; “ hallucination” in text parallels confident image completions where constraints are under-specified. Grounding looks different—pixels rarely cite sources—but workflows that combine retrieval of style references or layout constraints echo grounding principles.
Energy and access
High-resolution generation is compute-heavy; efficiency changes who can iterate creatively. Our note on energy and UX applies as much to GPUs rendering frames as to tokens streamed to chat clients.
Inpainting and regional control
Masked inpainting changes the conditioning graph—users edit regions while preserving context elsewhere. Failures often appear at mask boundaries; feathering, multi-step refinement, and latent-space consistency losses help—parallel to careful chunk boundaries in RAG.
Safety filters in media pipelines
Pre- and post-filters for NSFW or policy-violating content interact with diffusion sampling; false positives frustrate artists, false negatives create liability. Policy, UX, and model behavior must be co-designed—see red-teaming for adversarial prompts in image space.