AI Bias Compounds: The Recursion Deadline

The Recursion Deadline: Why the world needs new datasets on women — now, before the next training generation

12 June 2026

Submitted by Women at the Table | A+ Alliance for Inclusive Algorithms | AI & Equality Human Rights Toolbox

Part 1 of three companion papers. Part 1 (this brief) makes the scientific case for why corrective data on women must enter the foundational layer now. Part 2 (“The data gap AI can still close“) applies it to the open Swiss model Apertus as a concrete, near-term opportunity. Part 3 (“Who collects the record“) sets out who builds the data and how.

The problem, stated precisely

Women are systematically under-documented in the data from which foundation models learn. This is not one gap but many compounding ones: the digital divide that keeps hundreds of millions of women offline and therefore textually invisible; medical research that for decades excluded female subjects, leaving clinical corpora that describe a male-default body; economic statistics that do not record unpaid and informal work, where women’s labor is concentrated; content moderation and “quality filtering” pipelines that disproportionately strip out women’s speech, health discussion, and non-English text; and historical archives that recorded men’s words, deeds, and bodies at vastly higher rates. Foundation models trained on this record do not merely reflect these absences — they operationalize them, at scale, in clinical triage, credit scoring, social protection, hiring, and judicial contexts. (A note on terminology: where the deficit concerns the body — physiology, clinical data, trial exclusion — the operative variable is sex; where it concerns roles, labor, and participation, it is gender. Both are in play, and this brief uses each where it applies.)

One point bears emphasis before any discussion of subgroups: the medical data gap is universal before it is intersectional. Women as such — all women — were excluded from most clinical trials as a matter of policy for decades; the result is a clinical record in which drug dosing, adverse-effect profiles, cardiac symptom presentation, and pain are documented against a male default. No woman sits at the well-documented center of medical data. From that already-displaced baseline, distance compounds along every further axis: language, geography, income, age, disability. A health model is least reliable exactly where stakes are highest — and recursive training, by eroding the tails first, ensures the women furthest from the center are erased fastest. This is not a problem affecting a minority. It is a problem affecting half of humanity, with a gradient. The downstream harm is documented: clinical AI on sex-misrepresentative data has reduced diagnostic accuracy for women by 11.3 percentage points, and language models return different clinical assessments from identical notes by patient sex label (Invisible by Design, 2025).

Until recently this could be framed as a static representation problem: regrettable, but correctable over time as more data accumulates. The scientific evidence of the last two years shows that framing is wrong. The problem is dynamic, and it is compounding.

A note on why this agenda centers women specifically, rather than any single under-documented group: leverage. Women are not one minority among many — they are half of every other under-represented category (half the rural poor, half the non-Anglophone, half the disabled, half of every population a deployed model serves). Correcting the gender deficit is therefore the highest-leverage single intervention available, and the methodology developed to do it — audit the corpus, map the gap, source and weight against it — transfers directly to every other axis of exclusion. This is a general method, demonstrated on the case where it returns the most.

The mechanism: recursive learning erases the tails first

A growing body of peer-reviewed work — beginning with Shumailov et al. (Nature, 2024) — demonstrates that when models are trained on data generated by earlier models, degradation follows a predictable sequence: the tails of the data distribution disappear first. Early-stage collapse does not produce gibberish; it produces fluent outputs in which rare, minority, and underrepresented patterns have quietly vanished, while errors and biases present in synthetic data are amplified across generations (Wyllie et al., 2024; Dohmatob et al., 2024).

We state the strength of this claim precisely, because it is where the agenda must be defensible. The collapse dynamics are demonstrated most cleanly under pure or near-pure recursion; production pipelines still mix fresh human data with synthetic, and leading labs filter synthetic content specifically to slow the effect. The rate of degradation is therefore not yet established, and we do not assert a fixed deadline. What the literature does establish, and what is sufficient for the argument, is the direction and the asymmetry: recursion thins sparse regions rather than filling them, and the remedy the same body of work identifies — continued infusion of fresh, real, provenanced human data (Gerstgrasser et al., 2024; Kazdan et al., 2024) — works only where that data exists. The research priority is to measure the rate, not to wait for it.

Apply this finding to the gender data gap and the implication is direct. Wherever women’s data is already sparse — non-Western women’s health, women’s economic activity in informal sectors, women’s voices in low-resource languages, intersections of gender with disability, age, or rurality — that data sits in the statistical tail. Each generation of models trained partly on the synthetic output of its predecessors thins that tail further. The model’s confident, fluent description of “women” converges on the well-documented subset: Anglophone, urban, formal-economy, described through a male clinical default.

We describe this as the half-life of a biased model: representational gaps do not dilute across training generations — they decay toward zero, and the decay is invisible because output fluency is preserved. An estimated and growing share of the public web is now machine-generated. Every training run that ingests it without correction shortens the half-life.

Why the window is closing

The same literature points to the remedy — and to its deadline. Collapse is preventable when models retain access to fresh, real, provenanced human data alongside synthetic data (Gerstgrasser et al., 2024; Kazdan et al., 2024). Accumulating genuine data prevents the degradation that pure recursion guarantees. But this remedy only works if the genuine data exists, is documented as genuine, and covers the populations at risk of erasure. For women — and above all for women at the intersections of geography, language, and income — that data largely does not yet exist in machine-usable form. Once successive training generations have been built atop the thinned record, correction becomes a re-foundation problem rather than a data problem: exponentially more expensive, and possibly infeasible for the labs with the strongest commercial incentives to keep building forward.

The concrete is setting. There is a short period in which deliberately constructed datasets can still enter the foundational layer of the model ecosystem. After it, bias is no longer a property of models. It is a property of the substrate every model is built on.

A note on causation, stated plainly

The agenda rests on a causal chain whose first link is firmly evidenced and whose second link is partly evidenced and partly what we propose to measure. That the training record under-documents women is established (the clinical and economic evidence is summarized in our companion Swiss brief and its sources). That a model trained on this record produces materially worse outcomes for women is now documented at the level of outputs — clinical AI reducing diagnostic accuracy for women by 11.3 percentage points, language models returning different clinical assessments from identical notes by patient sex label, judicial tools overpredicting women’s recidivism (Women at the Table & FemTechnology, Invisible by Design, 2025; Women at the Table, Gender Bias in Judicial Algorithms, 2026). What is not yet quantified is the composition of the foundational corpus that produces those output disparities — the root beneath the documented symptoms. We do not paper over this gap; we make it the point. Post-training mitigation adjusts what a model says; the documented output disparities show it has not closed the gap, and how far it could is unknown without corpus-level measurement. Until the corpus is audited, the claim that downstream fixes are sufficient rests on no corpus-level evidence — and it is that claim which currently licenses deployment in clinical and social-protection settings. A central deliverable of this agenda is to close that evidentiary gap at the root rather than assume it away at the surface.

What must be built

1. A global public data commons on women — treated as scientific infrastructure, not advocacy. Consented, provenanced, longitudinal datasets covering the documented gaps: women’s health beyond the male clinical default; informal and unpaid economic activity; low-resource-language text and speech; civic and political participation. Built with — not extracted from — the communities described, drawing on existing capacity (Masakhane, AI4Bharat, community data cooperatives) and governed under human-rights frameworks. Data provenance certification is essential: in a web flooded with synthetic content, verified human data about underrepresented populations becomes the scarcest and most valuable training asset in the ecosystem. This reframing matters for funders: these datasets are stranded-asset insurance for every downstream model.

2. Fairness floors: minimum demographic performance thresholds for high-stakes deployment. No clinical, judicial, or social-protection deployment without demonstrated performance above a defined floor, evaluated intersectionally across gender, geography, and language. The panel is uniquely positioned to recommend the scientific basis for such thresholds, as it has analogues in every other safety-critical engineering discipline. HumRights-Bench, the first benchmark grounded in international human rights law, offers a starting architecture.

3. Longitudinal measurement of bias decay — beginning with a baseline that does not yet exist. No institution currently measures how representational gaps move across training generations, and more basically, no one has ever measured what share of a frontier training corpus documents female physiology or records women’s economic activity, because no fully open frontier corpus has been audited. That generation-zero number does not exist; its absence is the evidentiary hole beneath every claim that these systems are deployment-ready for women. We propose standing, independent infrastructure that establishes the baseline and then tracks defined demographic capabilities across successive frontier model generations — the equivalent of atmospheric CO₂ monitoring for the information ecosystem. (The newly released open Swiss model Apertus is the first frontier-scale corpus on which the baseline audit is actually feasible; see companion brief.)

4. A rights-grounded foundation model — built by the states already building. Whether bias can be fixed by remediating existing models, or only by training anew from corrected data, is an open scientific question — arguably the most consequential open question in AI equity. It should be answered empirically, and the actors positioned to answer it are not civil society organizations but member states already investing in sovereign AI capacity. Dozens of governments are funding national and regional foundation models right now. Almost none have made representative, provenanced data a design requirement — which means public money is currently reproducing, at sovereign scale, the same thinned substrate. Our plea is simple: at least one sovereign or consortium model should be trained on deliberately corrected data and evaluated head-to-head against fine-tuned frontier models on the populations those models underserve. If fine-tuning suffices, the field learns something vital and cheap. If it does not — as the collapse literature suggests it eventually will not — the world will need the demonstrated alternative before the substrate hardens. For the states involved, this is not philanthropy; it is industrial policy, and a differentiator no frontier lab is pursuing.

What is permanent, what is interim, what is cosmetic

These measures are not interchangeable, and the distinction matters for how the panel weighs them. The data commons is foundational: it is logically prior to every model, and it is the one asset that retains its value whichever way the remediate-versus-rebuild question resolves. Fairness floors and longitudinal measurement are permanent infrastructure — they would be required even in a world with a perfect foundation model, because models drift, deployment contexts shift, and every new training generation reopens the question. They are not bridges to a future fix; they are the standing safety apparatus of the field, as crash testing is to automobiles. By contrast, the measures most often presented today as solutions — debiasing fine-tunes, output filters, alignment patches applied after training — are interim at best and cosmetic at worst: they adjust what a model says while the substrate it learned from continues to decay underneath. A model can be patched to speak respectfully about women it fundamentally does not know. The panel should be careful not to let visible remediation activity be mistaken for systemic progress.

The ask

We ask the panel to recognize, in its scientific assessment, that (a) recursive training on synthetic data is now an established degradation mechanism whose first casualties are underrepresented populations; (b) the gender data gap therefore constitutes a time-critical infrastructure deficit, not a downstream fairness concern; and (c) the international research agenda should prioritize provenanced data commons, intersectional performance floors, longitudinal bias-decay measurement, and an empirical test of the remediate-versus-rebuild question — while the foundational layer is still wet.

The cost of building this infrastructure now is measured in millions. The cost of re-founding the model ecosystem after lock-in is measured in something closer to the cost of the ecosystem itself.

Contact: Women at the Table, Geneva | aiequalitytoolbox.com | womenatthetable.net

Sources

Recursive degradation. Shumailov I., et al., “AI models collapse when trained on recursively generated data,” Nature 631 (2024); Wyllie S., et al. (2024) on bias amplification across generations; Gerstgrasser M., et al. (2024) and Kazdan J., et al. (2024) on data accumulation as collapse prevention.

Clinical and judicial harm. Women at the Table & FemTechnology, Invisible by Design: Women’s Health as the Blind Spot in AI and Medicine (2025); Women at the Table, Gender Bias in Judicial Algorithms: A Global Analysis of Algorithmic Discrimination (CSW70 Expert Paper, 2026). Clinical-trial exclusion: NIH Revitalization Act of 1993 (P.L. 103-43); NIH Office of Research on Women’s Health. Economic valuation of unrecorded work: ILO, Care Work and Care Jobs (2018). Detailed sources in the companion Swiss brief.

Benefit-sharing precedent (for the data commons). Nagoya Protocol to the Convention on Biological Diversity (2014); CBD COP16 Decision 16/2 (Cali, 2024) establishing the multilateral DSI mechanism and Cali Fund, ≥50% to indigenous peoples and local communities; CARE Principles for Indigenous Data Governance (Carroll et al., 2020). Full architecture in the companion paper “Who collects the record.”

Image: Richard A Carter https://betterimagesofai.org / https://creativecommons.org/licenses/by/4.0/

Polytype diagrams with geometric tile designs as coinjoining ‘nodes’ with Devanagari numerals printed beneath each node. Some of the nodes on the right side of the image are in grey scale and the interconnecting edges of the nodes are joined by dotted lines. The left half of the image is in colour and shows blue and orange geometric tile designs; these nodes are joined by solid lines.

Comments are closed.