Recursive Synthetic Data Trainingedit

Concept page for training models repeatedly on data generated by earlier models.

Recursive Synthetic Data Training is a training process in which generated data from one model generation become part of the training set for a later generation. The process can be intentional, as in self-training or synthetic-data bootstrapping, or incidental, as generated content enters future training corpora.¹

Role in this wikiedit

This page explains the process behind model collapse. It is distinct from synthetic data in general: a one-time synthetic augmentation may be useful, while repeated reuse can amplify distributional errors. The wiki uses this page to separate the mechanism from the outcome. Recursive training is the loop; collapse is one possible degenerative result.

Connection to Qiao's workedit

When Sample Selection Bias Precipitates Model Collapse studies recursive training under local sample-selection bias. The paper's setting is especially relevant to AI and networks because the data process is distributed: parties may see different data, select different samples, and only share limited signals. Recursive synthetic-data training therefore becomes a cross-silo reliability problem, not only a generative-model problem.

Footnotesedit

The 2024 Nature paper "AI models collapse when trained on recursively generated data" helped popularize this recursive framing for generative models and synthetic corpora. ↩

Recursive Synthetic Data Trainingedit

Role in this wikiedit

Connection to Qiao's workedit

See alsoedit

Footnotesedit