中文

Sample Selection Biasedit

Concept page for distributional bias introduced by non-representative sample choice.

Sample Selection Bias occurs when the data chosen for training or evaluation are not representative of the population or target distribution the model is expected to handle. In this wiki the concept is important because selection bias can compound when a model is repeatedly trained on generated or locally filtered data.

Role in this wikiedit

The page explains a mechanism behind Synthetic Data failures. Selection bias is not merely a bad dataset label. It is a process: once a subset is preferred, missing modes may receive fewer examples, the model may generate them less often, and the next round of data may become even narrower. In networked settings, the bias may differ across parties, which makes diagnosis harder.

Connection to Qiao's workedit

The ICML 2026 paper When Sample Selection Bias Precipitates Model Collapse places this concept in the title. The paper studies how local selection behavior can precipitate collapse in recursive synthetic-data training and how collaborative signals can diagnose the distributional drift. This page is therefore one of the most direct background entries for Qiao's synthetic-data line and one of the bridges to AI and networks.

See alsoedit