Synthetic data: amplifier or echo chamber?
Model-generated examples can scale supervision—or bake in subtle errors until human or external verification breaks the loop.
Why teams use it
Human labels are expensive; synthetic pairs can cover long-tail intents, style variations, or tool-use traces. When quality controls hold, it unlocks progress where scaling alone stalls on data.
Failure mode: self-reinforcing mistakes
A model’s systematic blind spots become training targets for the next generation—unless filtered by external oracles, tool checks, or spot human review. Echoes also affect privacy-preserving pipelines if synthetic data inadvertently memorizes sensitive prompts.
Relation to benchmarks
Synthetic evaluation sets risk overfitting too—see benchmarks essay.
Legal and consent layers
Synthetic does not automatically sidestep copyright concerns if seeds derive from protected expression without permission.
Verification stacks
Pair synthetic generations with verifiers: smaller classifiers, symbolic checkers, execution on unit tests for code, or human spot checks on long tails—without verification loops, synthetic data is expensive fiction.
Curriculum and difficulty progression
Start synthetic tasks from easier distributions and increase difficulty as models improve—mirroring education and reducing collapse—while monitoring for evaluation leakage.