Generative media

Diffusion: denoising as a controlled hallucination

Iterative denoising turns noise into structure; prompts steer trajectories—often sensitively—while objectives only approximate real-world physics.

The trajectory view

Diffusion models learn to reverse a forward noising process. Each step nudges pixels or latents toward configurations consistent with training data and conditioning signals (text embeddings, control nets, etc.). Small prompt edits can reroute the entire path—not because the model “misunderstood English,” but because conditioning biases early steps that constrain everything downstream.

Video: time becomes another failure surface

Temporal coherence competes with per-frame detail; artifacts (hands, object drift) often signal objective mismatch, not mere randomness. Multimodal systems that stitch vision and language share analogous seams—see multimodal fusion—even when backbones differ.

Conceptual link to language

Language models also roll out token trajectories; “ hallucination” in text parallels confident image completions where constraints are under-specified. Grounding looks different—pixels rarely cite sources—but workflows that combine retrieval of style references or layout constraints echo grounding principles.

Energy and access

High-resolution generation is compute-heavy; efficiency changes who can iterate creatively. Our note on energy and UX applies as much to GPUs rendering frames as to tokens streamed to chat clients.

Inpainting and regional control

Masked inpainting changes the conditioning graph—users edit regions while preserving context elsewhere. Failures often appear at mask boundaries; feathering, multi-step refinement, and latent-space consistency losses help—parallel to careful chunk boundaries in RAG.

Safety filters in media pipelines

Pre- and post-filters for NSFW or policy-violating content interact with diffusion sampling; false positives frustrate artists, false negatives create liability. Policy, UX, and model behavior must be co-designed—see red-teaming for adversarial prompts in image space.