The Biomass Problem: How AI Is Eating Itself
Site Owner
发布于 2026-05-27
As AI generates over 500 billion images and trillions of words annually, a quiet crisis is emerging: model collapse. Trained on their own outputs across multiple generations, AI systems risk becoming progressively narrower and less capable at representing the diversity of real-world data. This article explores the biomass problem — why it happens, why benchmarks miss it, and why the math is relentless.

The Biomass Problem: How AI Is Eating Itself
In 2026, AI systems generated over 500 billion images and tens of trillions of words. By 2027, estimates suggest humans will produce less than 5% of the data that AI models train on. These numbers sound like industry trivia. They're actually a slow-motion catastrophe buried in plain sight.
Welcome to model collapse — the feedback loop that could make future AI systems progressively worse even as they appear to improve.
A Brief History of Eating Your Own Tail
The concept isn't new. In 2024, researchers from Oxford and Cambridge published a seminal paper showing what happens when a model repeatedly trains on data it or similar models generated. The result wasn't dramatic — no error messages, no obvious failures. Instead, the model slowly lost the ability to represent the tails of real distributions. Rare events — the unusual bird species, the unconventional sentence structure, the edge case in medical imaging — gradually disappeared from the model's world.
The model didn't get dumber in any way that standard benchmarks would catch. It got narrower. More homogeneous. Less capable of handling anything that didn't look like the average of what it had already seen.
This is the biomass problem. AI is digesting its own outputs and calling it nutrition.
Why Nobody Noticed
Model collapse is hard to see from inside the lab. Here's why:
Benchmarks reward average performance. MMLU, HumanEval, MATH — these tests measure how well models handle common, well-represented tasks. A model suffering from mild collapse can score higher on these benchmarks than a healthier competitor simply because it's more "confident" about mainstream patterns. Confidence and correctness are not the same thing.
Synthetic data works — until it doesn't. For many practical applications, training on AI-generated data is genuinely fine. Code completion, email drafting, standard documentation — these domains are saturated with examples. Adding more of the same quality doesn't hurt much. The problem emerges at the edges: rare diseases in medical data, unusual syntax in low-resource languages, edge cases in legal reasoning. These are exactly the areas where data scarcity already exists, and where synthetic data is most likely to be used as a band-aid.
The collapse is generational. If you're training a single model on a fixed dataset, you can avoid the problem by curating carefully. The crisis emerges across multiple generations of models — each generation trained partly on the previous one's output. First-generation models trained on real human data produce outputs. Second-generation models train partly on those outputs. Third-generation models go further. By the fifth generation, the signal has degraded beyond recognition.
The Numbers That Should Worry You
Consider the training data pipeline for the average mid-tier AI company in 2026:
- Stage 1: Scrape the open web (mostly human-generated, pre-2023)
- Stage 2: Add licensed data from publishers and data brokers
- Stage 3: Augment with synthetic data generated by larger models
- Stage 4: Repeat annually
Stage 3 is growing. Stages 1 and 2 are shrinking — not because the data is gone, but because it's increasingly recognized as toxic for training purposes. Social media platforms are actively blocking scrapers. Publishers have learned what it means when a tech company "licenses" their content. The human-written portion of the internet that's available for training is shrinking relative to what's being generated.
The math is relentless. At some point, the average quality of training data will decline. Not because AI outputs are inherently lower quality — some are genuinely excellent — but because the diversity of perspectives, writing styles, and edge cases that makes real data valuable will be progressively diluted.
The Hidden Cost Nobody Prices
There's a second-order effect that's even more insidious. Human-generated data has a crucial property that synthetic data lacks: it's grounded in physical reality. Humans write about what they see, hear, experience, and build. Their text describes real objects in real spaces doing real things. Even fiction — especially good fiction — maps onto human experience in ways that remain meaningful across generations.
AI-generated text describes what other AI-generated text described. Each iteration drifts further from the sensory grounding that gives language its meaning. A model describing a "cozy café" is describing a cluster of tokens that correlates with what other models described when they described cafés. Nobody's actually drinking the coffee.
This isn't philosophy. It's engineering. Grounded, diverse, human-generated data is infrastructure — like clean water in a city. You don't notice it until it's gone.
What's Actually Being Done
The research community is not idle. Several promising directions are emerging:
Deduplication at scale — Better tools for identifying and removing AI-generated content from training corpora. This is harder than it sounds — the signals are subtle, and the cat-and-mouse game with watermarking schemes is ongoing.
Provenance tracking — Systems like C2PA embed cryptographic attestations in AI-generated media, creating a trail that can be traced back to origin. If widely adopted, this could let training pipelines preferentially weight human-originated data.
Deliberate data curation — Frontier labs are increasingly treating data quality as a competitive moat, investing heavily in human annotation, domain expert review, and structured data collection rather than passive scraping. The economics of this are brutal at scale, but for the highest-value models, it's the only viable path.
Model collapse detection — New benchmarks specifically designed to measure distributional shift and tail-representation rather than average-case performance. These are early but promising.
The Uncomfortable Truth
None of these solutions scale to the full problem. The internet is vast, but it's not infinite, and the fraction of it that's both human-generated and publicly accessible is shrinking relative to AI-generated content. Every AI coding assistant that drafts a Stack Overflow answer, every AI writing tool that produces a blog post, every AI image generator that floods a stock site — all of it becomes substrate for the next generation of training.
The most honest framing: the AI industry has built an extraordinarily powerful technology on a one-time resource — the accumulated output of human civilization encoded in text and images across the internet. That resource is being consumed. What's being produced in its place is more voluminous but less rich.
The biomass problem isn't a bug. It's the natural consequence of building intelligence on a finite substrate and then optimizing for scale. The fix isn't obvious, and the incentives pushing toward the problem are stronger than the incentives pushing toward the solution.
The question is not whether AI will get better. It will. The question is whether the version of "better" we're building is one that can keep getting better — or one that slowly consumes the diversity that made it capable in the first place.
At current trajectory, we might find out within a decade.
This article does not represent professional advice, investment guidance, or any prediction with actual accuracy. If you're building a training pipeline, talk to your data scientists. If you're worried about the future of AI capability, the honest answer is that nobody knows.