EKOAHAMEKOAHAM
Research
EKOAHAM Research · Vol. 7, Issue 2
Research Article

Iterative Degradation of Large Language Models Under Synthetic Data Feedback: A Systematic Review of Model Collapse Mechanisms and Mitigation Approaches

16 min readVol. 7, Issue 2
Iterative Degradation of Large Language Models Under Synthetic Data Feedback: A Systematic Review of Model Collapse Mechanisms and Mitigation Approaches

Executive Summary

AI-generated data in training sets can induce gradual “degradation” of model performance. In this report we define degradation in terms of measurable metrics (perplexity, accuracy on benchmarks, and output diversity) and survey evidence that self-generated data can degrade AI quality. Multiple studies (and a recent Nature article) document a phenomenon called model collapse: iteratively training on model outputs causes models to lose information about rare cases and produce narrower, more repetitive outputs. This manifests as rising perplexity, lower accuracy on complex tasks, and reduced linguistic diversity. For example, Shumailov et al. find that an OPT-125M model fine-tuned on its own generated text suffers perplexity increases of ~20–28 points versus training on true data, and the “tails” of the data distribution disappear over generations. Similarly, Guo et al. report consistent declines in lexical/syntactic/semantic diversity across generations of LLM fine-tuning on synthetic text. These effects are not limited to text: Hataya et al. show synthetic images can degrade image classification accuracy (attributing it to fewer modes in synthetic data), and Bohacek & Farid find GANs retrained on even small amounts of their own outputs produce highly distorted images.

The mechanisms include data contamination (new models ingesting prior models’ outputs), distribution shift (the training distribution drifting away from the true data distribution), and reinforcement of errors (mistakes or biases compounding across iterations). Static evaluation benchmarks are also affected: “data contamination” of test sets by shared web content can bias measurements. Empirical simulations confirm these effects: e.g. iterative “telephone games” with LLMs reveal gradual distortion of content and amplification of certain biases.

To mitigate these risks, researchers propose rigorous data curation (filtering or limiting AI-generated content, preserving original human data), human-in-the-loop validation (to catch biases and unrealistic patterns), dynamic evaluation protocols (constantly updated test sets like LatestEval), and continual learning strategies (always mixing fresh real data). IBM and others emphasise tracking data provenance, retaining diverse human-sourced data, and combining synthetic with real data. Ethically, unchecked degradation could narrow knowledge (losing “long-tail” ideas), amplify biases, and erode trust in AI. In sum, the evidence suggests caution: widespread use of AI-generated data in model training can appear to make models “dumber” over time, unless careful safeguards are implemented.


1. Definitions and Metrics of Degradation

Degradation of a language model is typically observed as declining performance on tasks or less rich outputs. We define it via metrics like:

  • Perplexity on held-out text: higher perplexity indicates a worse fit to the language distribution. Shumailov et al. observed that iterative self-training increased OPT-125M’s perplexity by ~20–28 points versus the baseline.
  • Accuracy on benchmarks: e.g. multi-task language understanding (MMLU) or exams. GPT-4 demonstrated large gains over GPT-3.5 on such benchmarks (e.g. 75% vs 50% on a medical licensing exam), so any future declines would be notable. However, if models memorize test content (data contamination), static benchmark scores can become unreliable.
  • Diversity measures: lexical/syntactic/semantic diversity of outputs (e.g. type-token ratio, Shannon entropy). Guo et al. introduce such metrics and find they steadily decline when an LLM is fine-tuned iteratively on its own generations.
  • Qualitative behavior: anecdotal measures such as hallucination rates or human ratings of coherence. IBM notes that collapsed models tend to produce “irrelevant, nonsensical and repetitive” text.

In short, degradation can be quantified by rising uncertainty (perplexity), falling task accuracy, and shrinking output diversity. Importantly, the term “model collapse” has been used to describe this iterative decline when training on model-generated data.


2. Evidence from Research and Benchmarks

Academic studies have directly investigated self-training effects:

  • Shumailov et al. (Nature 2024) – This key study introduces “model collapse,” showing that recursively training GPT-like models on their own outputs causes irreversible defects. An OPT-125M model trained generation-by-generation on synthetic Wikipedia text lost nearly all rare content: its output distribution’s tails disappeared and perplexity rose. When even 10% of original data was preserved per iteration, degradation was much smaller, highlighting the importance of human data. The authors warn that large-scale use of LLMs will “pollute” the corpus for future models, making genuine human data ever more valuable.
  • Guo et al. (NAACL 2024) – Studied linguistic diversity under recursive synthetic fine-tuning. They report a consistent decline in lexical/syntactic/semantic diversity of model outputs over successive iterations. This effect was especially pronounced for creative text tasks. Their work underscores that training on synthetic text risks eroding a model’s expressivity and richness.
  • Kovač et al. (INRIA 2025) – Empirically examine how feedback loops depend on data properties. They confirm that the severity of distribution shift varies by dataset and synthetic-data ratio. Importantly, they find lexical diversity amplifies degradation, while semantic diversity and overall data quality mitigate it. In practice this suggests that rich, varied human text can buffer against collapse. Their iterative “chain” simulations mirror real-world web data mixing and show distribution shift (misrepresenting true data) setting in as synthetic data dominates.
  • Kazdan et al. (Stanford 2024) – Analyze two scenarios: “replace” (each new model trained only on its predecessor’s outputs) vs “accumulate” (all real+synthetic data are kept). They find that collapse (growing test loss on real data) invariably appears in the replace scenario, but is largely avoided when real data continually accumulate. Under realistic fixed-compute limits, losses plateau rather than diverge. They also note a non-trivial interplay: synthetic data can even help when real data are scarce, but can hurt when real data are plentiful. This work implies that as long as training always includes sufficient human-generated data, catastrophic decline can be prevented.
  • Hataya et al. (CVPR 2023) – In the vision domain, they “contaminate” ImageNet and COCO with Stable Diffusion-generated images. Models trained on these mixed datasets perform worse on classification and generation tasks than those trained on only real data. They attribute this to synthetic images covering fewer visual modes than real images. Although in images, this parallels text findings: synthetic data tend to be statistically impoverished, harming downstream performance.
  • Bohacek & Farid (ICLR 2025) – Focus on image GANs: retraining even small models on their own synthetic images yields severely degraded, distorted outputs. Crucially, models did not fully recover quality even after retraining on real images. This “nepotistic” collapse shows the durability of the contamination effect, suggesting that once a model is biased by its own data, it may not easily revert.

These controlled experiments establish that iteratively training on AI-generated data can degrade model quality. They provide concrete metrics: rising perplexity, lower accuracy/diversity, and visible output corruption.

Benchmarks and Real-World Observations: Official benchmark results (e.g. GPT-4 vs GPT-3.5) show improvement when models incorporate new data and scale. However, these do not rule out subtle declines in conversational quality or creativity. Some reports (e.g. developer forums, social media) suggest users notice ChatGPT responses becoming more formulaic over long sessions, although such anecdotes lack rigorous measurement. Importantly, static benchmarks can be compromised: as models are trained on web text, standard test sets (often from the web) may leak into training, inflating scores. New benchmarks like LatestEval avoid this by sourcing fresh content, finding that state-of-the-art LMs have much higher perplexity on truly novel text. In summary, while public metrics for newer models remain high, the research literature provides consistent evidence that synthetic data loops can degrade underlying model competence over time.

StudyDomainSetupKey FindingsMitigation
Shumailov et al. (2024)NLP (OPT-125M)Iterative self-training on wiki textTails disappear; perplexity rises ~20–28 points; collapse unless original data preservedKeep human data (10% preserved avoids most loss)
Guo et al. (2024)NLP (various NLG tasks)Recursive fine-tuning on synthetic outputsConsistent decline in lexical/syntactic/semantic diversityUse diverse training sources; human review
Kovač et al. (2025)NLP (Twitter, Reddit)Vary synthetic-vs-human ratios and data propertiesLexical diversity amplifies collapse; higher semantic diversity/quality reduce itEnsure high-quality, semantically rich data, preserve domain variety
Kazdan et al. (2024)Theory/LLMs“Replace” vs “Accumulate” training workflowsCollapse occurs under “replace” (train only on synthetic), but largely avoided when real data accumulate. Synthetic helps if real data scarce, hurts if ampleAlways include fresh real data in training set (accumulation strategy)
Hataya et al. (2023)Vision (ImageNet/Coco)Replace real images with diffusion-generated imagesDownstream accuracy drops on classification; synthetic images have fewer modesFilter synthetic images; watermark; combine with real data
Bohacek & Farid (2023)Vision (GANs)Retrain model on its own generated imagesEven small amounts of self-generated data cause “highly distorted” outputs; damage persists on retrainingAvoid retraining exclusively on synthetic; mix real data back
IBM analysis (2024)Industry (LLMs)Summary of literature + examples“Models trained solely on predecessors’ output produce increasingly inaccurate results”; knowledge of rare events fades (“long-tail ideas” drop)Retain human data, track provenance; dynamic benchmarking

3. Mechanisms of Degradation

  • Data Contamination (Feedback Loops): As more AI-generated text enters the web, future LLMs may train on it. Shumailov et al. warn that “the use of LLMs at scale to publish content on the Internet will pollute the collection of data to train their successors.” In effect, a model trained on its predecessor’s output perpetuates any biases or errors. This feedback loop gradually shifts the training distribution away from true human data towards a “hallucinated” version of reality.
  • Distribution Shift: Repeated self-training causes a shift between the true data distribution and what models learn. Kovač et al. call this a mismatch between “true and generated distribution.” Over generations, rare or extreme cases (the “tails”) are underrepresented. Shumailov et al. depict this as models “forget[ting] the true underlying data distribution… losing information… tails disappearing.” Practically, outputs become focused on the most probable sequences seen so far.
  • Reinforcement of Errors: Small mistakes compound. In Shumailov’s example, later generations produced outputs (e.g. about “jack rabbits with different-colored tails” for an architecture query) that no human would write. These outliers stem from earlier models’ misperceptions, which are then learned as “norm.” Over time, errors introduced by one model are reinforced in the next, creating new errors.
  • Model Drift vs. Collapse: This phenomenon differs from classic model drift (due to changing real-world input) or catastrophic forgetting (within one model). Here the drift is self-induced: each model’s outputs feed into the next. IBM distinguishes it from other effects: model collapse occurs over generations as training sets are polluted, whereas catastrophic forgetting is within a single model’s life. The performative prediction analogy is apt: a model’s outputs influence future training data, like a self-fulfilling prophecy.
  • Evaluation Bias: Standard benchmarks can mask degradation. Many test items are drawn from online sources (news, Wikipedia) that may appear in LLM training sets. If an LLM simply memorizes or copypastes answers, it can score artificially high. LatestEval and similar dynamic tests avoid this by using new text. Li et al. show models have no prior knowledge of newly created Wikipedia content, yielding higher perplexity. Thus, observed “degradation” might partly be that older benchmarks become contaminated and less sensitive to declines.

In summary, the degradation arises because self-generated data lack human diversity and contain errors, and because these get re-ingested unfiltered. The model’s view of the world narrows and shifts over time.


4. Case Studies: GPT and Other LLM Families

  • OpenAI GPT-series: Each official release (GPT-2 to GPT-3.5 to GPT-4) showed metric improvements, but our concern is future retraining. OpenAI has not publicly detailed how much ChatGPT’s user interaction data (some of which is synthetic) feeds into new models. In principle, if ChatGPT answers are scraped and retrained upon, any quirks or errors in those answers could propagate. The GPT-4 technical report does not address this loop; it only compares GPT-4 (Mar 2023) to prior models on benchmarks. Anecdotally, some users noticed ChatGPT 4.1 or 4o (2024 updates) giving less reliable answers than earlier versions, but this is not officially quantified.
  • Open-source LLMs: Meta’s LLaMA, Anthropic’s Claude, Google’s Gemini, and others likely face similar risks if trained on scraped web text. For example, an experiment fine-tuning Meta’s OPT-125M with its own outputs (cited by IBM) produced bizarre replies (the “jack rabbit” example). Although detailed performance timelines (like for GPT) are scarce, we can anticipate that any model family relying on large web crawls will need to guard against synthetic feedback.
  • Benchmark Performance Trends: GPT-3 (2020) scored modestly on multi-task exams; GPT-4 achieved human-level or near-human scores (e.g. >85% on MMLU, far above GPT-3.5). No official data shows these scores eroding. However, continuous evaluation is tricky: static benchmarks are leaked over time. Recent leaderboards note newer models like GPT-4o and Google Gemini outperform older ones, but again those comparisons assume fresh training data. In practice, no public model has yet “officially” gotten worse in benchmarks over time. The concerns are more predictive: if AI-generated content dominates the web, future models (e.g. hypothetical GPT-5+) might plateau or degrade without intervention.
  • Corporate Reports: Industry analyses (IBM Think, Forbes, etc.) corroborate academic findings. IBM’s blog clearly defines model collapse for LLMs and cites the OPT-125M study. It warns that common AI failures (hallucinations) could worsen as rare facts disappear. While not a “case study” per se, these reports reflect the consensus that generative loops pose a real risk.

Overall, the GPT and LLM case studies illustrate potential issues, but the real evidence comes from controlled experiments (see Section 2). Actual products may hide internal improvements that offset such loops. The timeline of LLM development is shown below:

timeline
    title Timeline of Key LLM Releases and Findings
    2019 : GPT-2 released (first large open LM)  
    2020 : GPT-3 (175B) sets new benchmarks  
    2022 : ChatGPT (Nov, GPT-3.5-based) gains wide usage  
    2023 : GPT-4 (Mar) outperforms GPT-3.5 on complex tasks  
    2024 : Studies highlight risks of AI-data feedback (e.g. Shumailov 2024, Guo 2024)  
    2025+ : Ongoing research on synthetic data effects and mitigation strategies  

5. Mitigation Strategies

To prevent or slow degradation, several approaches have been proposed:

  • Data Curation & Provenance: Carefully filter training data to limit AI-generated content. Track the source of each document; prefer human-authored sources. IBM recommends “tracking data provenance” and preserving original human-generated data. Shumailov et al. show that retaining even 10% of the original data eliminated most performance loss. In practice, large-scale corpus builders (Common Crawl, Wikipedia dumps, etc.) could tag or watermark AI text to exclude it from future corpora.
  • Mixing Synthetic with Real Data: Always combine synthetic outputs with fresh real data (“accumulate” strategy). Kazdan et al. find that when real data accumulate over time, collapse is largely avoided. If synthetic data must be used (e.g. to augment scarce data), it should be blended with verified human examples. Maintaining a healthy ratio of human-written content can act as a stabilizer.
  • Human-in-the-Loop (HITL): Incorporate human oversight at various stages. For example, use human reviewers to validate AI-generated training examples or to correct model outputs used for further training. Humans can catch subtle biases or factual errors that synthetic data amplifies. As one industry overview notes, synthetic data often lacks real-world nuance, so “without human oversight, the model has no way to learn these real-world exceptions.” Organizations should have pipelines where annotators spot-check or refine synthetic data before it enters the training set.
  • Filtering and Watermarking: Develop techniques to detect AI-generated text. OpenAI and others have researched watermarks in model outputs. Filtering out content flagged as AI-generated before re-training can break feedback loops. Similarly, image models now watermark synthetic images; analogous methods for text (e.g. n-gram encryption) are being explored. These tools aren’t foolproof yet, but they offer a way to exclude much of the synthetic noise.
  • Continual Learning & Curriculum: Instead of one-shot pretraining, use continual updates with new data. Continual learning methods (e.g. fine-tuning on new real-world data) can adapt a model gradually and prevent sudden distribution drift. Curriculum strategies might prioritize rarer content first. In practice, model developers can schedule regular retraining on curated human data to offset any synthetic overrepresentation.
  • Dynamic Evaluation Protocols: Use continually updated benchmarks to avoid contamination. The LatestEval pipeline exemplifies this: by generating test questions from fresh sources (ArXiv, news, GitHub) beyond any model’s training time, it ensures models can’t “cheat” by memorization. Such rolling evaluations force models to truly generalize. Companies and researchers should also monitor performance on out-of-distribution tasks to catch hidden degradation.
  • Governance and Auditing: Implement governance tools to audit training corpora and model outputs. The IBM AI Academy suggests “data governance” and auditing for collapse. Industry-wide, there could be standards (or regulations) requiring disclosure of data sources. Open repositories of training data (with provenance metadata) would allow scrutiny and selective curation.

These strategies aim to break the feedback loop. For instance, IBM’s recommendations include retaining non-AI data sources and leveraging data accumulation, which echo the accumulate findings of Kazdan. In essence, ensuring access to clean human data and carefully mixing synthetic content can substantially mitigate quality loss.


6. Ethical and Societal Impacts

If unchecked, LLM degradation has broad implications. Models that gradually lose nuance risk entrenching biases and shrinking the diversity of ideas. IBM warns that collapsed AI systems may omit rare but important information – “long-tail” ideas could “fade out of the public’s consciousness.” For example, an AI search tool might stop retrieving obscure but correct facts, reinforcing echo chambers.

Practical consequences include poor decision-making: an AI medical assistant might stop considering rare diagnoses, or an AI lawyer might only cite popular case law. IBM gives the scenario of an AI failing to diagnose a rare disease because it “forgot” that condition from its (skewed) training data. Similarly, user experience can suffer: recommendation systems might only suggest blockbuster items, frustrating users with niche tastes. Over time, knowledge decline in AI could erode public trust in these systems.

Ethically, there is a fairness dimension: if synthetic data loops reflect dominant biases, marginalized viewpoints may vanish from model outputs. As one source notes, synthetic data tends to “double down” on any bias present in its original training data. Without intervention, these biases could propagate.

Finally, there is a cultural impact: generative AI today crafts much of the new content we read (news articles, social media, etc.). If this content becomes increasingly synthetic and self-referential, the collective knowledge base that future models learn from is impoverished. This could slow progress and innovation, as novel or rare ideas are underrepresented.

In summary, the potential harm of LLM degradation includes loss of knowledge diversity, reinforcement of bias, mistakes in AI-assisted decisions, and erosion of trust. These stakes underscore why mitigating synthetic feedback loops is not just a technical issue but a societal one.


7. Conclusion

In conclusion, a convergence of research points to one clear insight: Reliance on unfiltered AI-generated data for model training tends to degrade model quality over iterations. This “dumbing down” happens as models’ errors and biases are fed back into the training set, narrowing the learned distribution. Empirical studies across domains (text and vision) consistently show that synthetic self-training raises perplexity, reduces diversity and accuracy, and yields progressively worse outputs.

However, model collapse is not inevitable if handled properly. Studies by Kazdan et al. and IBM emphasize that including fresh human data can halt or reverse the decline. The key is awareness: organizations must recognise that AI-driven data augmentation is a double-edged sword. By adopting careful data curation, human oversight, and robust evaluation, we can harness synthetic data’s benefits without poisoning our future models.

Going forward, model developers should treat “AI training data” as a risk factor. The “tail” events – rare facts, creative expressions, minority voices – must be actively preserved, since models themselves may not spontaneously regenerate them. In doing so, we protect not just model performance but the diversity and richness of knowledge that AI can offer humanity.