Recursive Synthetic Data Trainingedit
Concept page for training models repeatedly on data generated by earlier models.
Recursive Synthetic Data Training is a training process in which generated data from one model generation become part of the training set for a later generation. The process can be intentional, as in self-training or synthetic-data bootstrapping, or incidental, as generated content enters future training corpora.1
Role in this wikiedit
This page explains the process behind model collapse. It is distinct from synthetic data in general: a one-time synthetic augmentation may be useful, while repeated reuse can amplify distributional errors. The wiki uses this page to separate the mechanism from the outcome. Recursive training is the loop; collapse is one possible degenerative result.
Connection to Qiao's workedit
When Sample Selection Bias Precipitates Model Collapse studies recursive training under local sample-selection bias. The paper's setting is especially relevant to AI and networks because the data process is distributed: parties may see different data, select different samples, and only share limited signals. Recursive synthetic-data training therefore becomes a cross-silo reliability problem, not only a generative-model problem.
See alsoedit
Footnotesedit
-
The 2024 Nature paper "AI models collapse when trained on recursively generated data" helped popularize this recursive framing for generative models and synthetic corpora. ↩