Apertus: The AI Model That Can Prove Its Bias

The data gap AI can still close: A solvable scientific problem, a closing window, and why Switzerland is positioned to set the standard

12 June 2026
Women at the Table | A+ Alliance for Inclusive Algorithms | Geneva

Part 2 of three companion papers. Part 1 (“The Recursion Deadline”) makes the scientific case for why corrective data on women must enter the foundational layer now. Part 2 (this paper) applies it to the open Swiss model Apertus as a concrete, near-term opportunity. Part 3 (“Who collects the record”) sets out who builds the data and how.

A foundation model deployed in a clinic, a loan decision, or a benefits assessment is making a judgment about a woman in roughly half the cases it handles. Not an edge case. Half. And the data these models learn from describes women poorly, and describes some women barely at all.

This is not one fairness concern among many, and it is not the diffuse, unsolvable kind of bias that every model carries and no one can fully remove. It is a specific, measurable deficit in the core function of these systems, affecting half of every population they serve — and, as research now shows, it is getting worse on its own. Apertus is one of the very few frontier-scale models whose training data is open enough to measure this deficit directly, and the one Switzerland built. That makes it the natural place to correct the deficit by design rather than patch it at the surface. That is the opening.

Why this is not marginal

The clinical case is the sharpest. For decades, as a matter of regulatory policy, women were excluded from most clinical trials; US trials defaulted to male physiology until the NIH Revitalization Act of 1993 first wrote the inclusion of women into law. The medical literature that resulted — and the models trained on it — treats the male body as the default: dosing, adverse drug reactions, the very presentation of disease. This is not a gap at the periphery of medicine. It sits at the centre, and it affects all women before it affects any subgroup of women. The downstream effect on AI is now documented, not hypothetical: clinical AI trained on sex-misrepresentative data has been found to reduce diagnostic accuracy for women by 11.3 percentage points; cardiac-risk algorithms underperform for women even when trained on sex-balanced datasets; and state-of-the-art language models produce different clinical assessments from identical case notes depending only on whether the patient is labelled female or male (Women at the Table & FemTechnology, Invisible by Design, 2025). A clinical model trained on this record does not merely inherit the error — it automates it, at the speed and scale of software. From that already-displaced baseline, the distance compounds for women further from the documented centre: non-European, non-Anglophone, rural, older, disabled. But the foundation of the problem is universal. The male-default body is the body the model knows.

The economic case is structural and large. The ILO estimates that unpaid care work — performed roughly three-quarters by women — amounts to 16.4 billion hours a day and would be worth on the order of US$11 trillion, about 9% of global GDP, if paid at minimum wage; in parts of Latin America it exceeds 20% of GDP. Almost none of this activity, nor the vast informal trading and home-based production that women also concentrate in, appears in the economic statistics that train financial and planning models. The systems now entering credit scoring, social protection, and economic planning are therefore structurally blind to a large fraction of women’s actual economic activity. A model cannot price, score, or allocate around what its training data never recorded — so the entirely predictable result is that automated economic decisions read women’s creditworthiness, eligibility, and productivity through data that systematically omits how women earn and what women do. The exclusion is not a rounding error; it is, by the ILO’s own valuation, one of the largest unmeasured sectors in the world economy.

And the systemic case is the one that should hold a policymaker’s attention. As foundation models become the layer through which people reach health information, financial services, and government, a model that cannot represent half a population is no longer a technical imperfection — it is an exclusion built into the infrastructure of public life. An economy that systematically misreads half its workers and a public sphere increasingly mediated through a demographically skewed layer are not the conditions of a stable, legitimate state. The stakes run from individual misdiagnosis to economic parity to, ultimately, who the digital state can see and serve.

A reasonable person will ask why gender, rather than any of the other groups the record under-documents. The answer is leverage. Women are not one under-represented category among many; they are half of every other category — half the rural poor, half the non-Anglophone, half the disabled, half of every population a model serves. Correcting the gender deficit is therefore the single highest-leverage intervention available, and the method built to do it — audit the corpus, map the gap, source against it — transfers directly to every other axis of exclusion. Gender is where the same fix returns the most.

Won’t downstream fixes handle it?

This objection deserves a direct answer, because a model builder will raise it first: don’t post-training methods — fine-tuning, reinforcement learning from human feedback, output filters — already correct for a skewed corpus? They can adjust what a model says. They do not repair what it has and has not learned. A model whose training data encoded the male-default body can be tuned to speak carefully about women while still reaching first for the male-default pattern when it estimates a dose, weighs a symptom, or scores a risk — because the correction sits on the surface and the deficit sits in the representation.

Whether, and how far, surface mitigation compensates for a skewed substrate is not a settled question — but the evidence available cuts in one direction. Output-level disparities are already documented: the same models produce different clinical assessments by sex label, despite whatever mitigation is applied (Invisible by Design, above). That tells us mitigation has not closed the gap. What is not yet measured is the composition of the training corpus that produces those disparities — the root beneath the documented symptom. That is what the audit proposed below establishes. Until it is run, the claim that downstream fixes are sufficient rests on no corpus-level evidence at all, while the disparities they are meant to fix are already on the record. And it is the unproven claim — that the fixes suffice — that is currently being used to license deployment in clinics and credit decisions. The burden of proof belongs with deployment, and on the substrate it has not been met.

Why it gets worse on its own

The intuition is that this corrects itself: more data accumulates, gaps fill, models improve. That intuition is wrong, and the mechanism is established in peer-reviewed work, not speculation. Shumailov et al. (Nature, 2024) and subsequent studies show that as models increasingly train on data generated by earlier models — now an unavoidable feature of a web filling with synthetic text — the sparse regions of the data distribution collapse first, and embedded biases amplify across successive generations. Women’s data is already the sparse region. So under recursive training the deficit thins rather than fills.

Honesty about the strength of this claim matters, because it is where a rigorous reader will push. The collapse dynamics are demonstrated most cleanly in controlled settings of pure or near-pure recursion; real training pipelines still mix fresh human data with synthetic, and serious labs actively filter synthetic content precisely to slow this effect. So the rate at which the window is closing is not precisely known, and this brief does not claim a date. What is not in dispute is the direction: recursion thins the tails rather than filling them, the gap on women compounds rather than corrects, and the cost of remedying it rises with each generation. This is what our companion scientific brief calls the half-life of a biased model — the deficit decaying across training generations while output fluency hides the loss. A precautionary case does not need the exact closing date; it needs the fact that the door moves in one direction only. There is a window in which corrective data can still enter the foundational layer. After it narrows, the deficit stops being a property of any one model and becomes a property of the substrate every future model inherits.

Why this is a tractable scientific problem, not a moral burden

This is the reframing that matters for the people who build models. We are not asking the Apertus team to solve bias in general — an open-ended, possibly unsolvable task. We are pointing to one bounded problem that is measurable, has a defined target, and is uniquely solvable on this model.

It is measurable because Apertus is open. Most frontier models’ data deficit can only be inferred from their outputs, because their corpora are closed; Apertus’s corpus is published and documented, which makes a direct audit of the training data itself possible. And here is the fact that should focus the room: no one has ever measured what share of a frontier training corpus documents female physiology, or records women’s economic activity, because no one has had an open frontier corpus to measure. That number does not exist. Its absence is not a detail — it is the evidentiary hole at the centre of every assurance that these models are safe to deploy in women’s health or women’s finance. The method that fills it is standard NLP, not novel research: classifier pipelines label the corpus at scale (what share of medical text addresses female physiology and presentation rather than a male default; what share of economic text captures informal and unpaid work; how representation varies across the corpus’s 1,000+ documented languages), validated by stratified human annotation. It yields the field’s first composition baseline for a frontier corpus. Cost is a research team and modest compute for roughly a year — low single-digit millions, against a training run that consumed thousands of GPUs for months.

It has a defined target because the fix follows the method the Apertus team has already used. They looked at the linguistic composition of the web, refused to treat it as neutral, and re-weighted and sourced deliberately so that Swiss German and Romansh — a language of some 60,000 speakers — entered the corpus by design. That is precisely the operation required here: the audit produces a map of where the corpus is thin, and the next training run sources and weights corrective data against that map. Switzerland has already proven it can engineer a representational principle into a frontier model; the same method now needs to be applied to the larger case, and the next training cycle is the practical opportunity to do it.

The standard: a demographic performance floor

The durable contribution here is not a single corrected model but a standard. Every safety-critical field requires a system to clear a defined threshold before deployment: a drug is no longer approved on male-only trial data; a bridge is not certified to carry half its rated load. AI entering clinical, financial, and judicial use has no equivalent floor for demographic performance. We propose one — a minimum performance threshold evaluated across sex, language, and geography, below which a model is not deployed in high-stakes settings. Because Apertus can be measured and corrected, it is the natural place to define that floor and be the first to meet it, making a Swiss model the reference standard the rest of the field must answer to. This is the same floor that Women at the Table has proposed, in WSIS Action Line terms, as a measurable gender indicator for high-risk AI — here given the model on which it can first be demonstrated.

Four proposals

1. Make demographic representation a measured design requirement for the next Apertus generation — the same status linguistic representation held in the first. The result would be the first model anywhere whose training deliberately corrected the documented data deficit on women rather than inheriting it.

2. Fund and publish the corpus audit. Establish the composition baseline the field lacks, on the team’s own terms, with the remedy already underway. Convert Swiss openness into measurement infrastructure that carries Switzerland’s name.

3. Anchor a provenanced data commons on women. Consented, community-owned, collected with women’s organizations as the field force, beginning where the deficit is most lethal: health, economic participation, low-resource languages. The collection architecture is developed and phased, its first stage costed at roughly CHF 2.5–3.5 million and built on certification infrastructure already operating across five continents. The financing model is not speculative: under the Convention on Biological Diversity, COP16’s Cali Fund (2024–25) established that commercial users of digitized, community-derived resources contribute to a fund reserving at least half its proceeds for the communities who steward them — a principle now agreed by 196 governments. A data commons on women applies that principle to a new resource class, and Geneva is its natural custodian.

4. Host the observatory in Geneva. Standing, independent measurement of how demographic representation moves across model generations — performed today by no institution on earth. Its findings have value only if no measured party controls it; an observatory inside a lab or a national-champion program produces numbers no one else will trust. Geneva is where the world already keeps measurements rival states treat as fact — WHO health statistics, ITU standards, WMO-coordinated climate data — within reach of the OHCHR human-rights framework and an hour from the compute and expertise at CSCS, EPFL, and ETH. It passes the three tests no other city passes together: independence, accepted authority, technical capacity.

The cost of waiting

Today, correction means adding the right data to the next training run — a research-scale cost. After two or three more generations, it will not. The thinned record will be baked into the corpora, the checkpoints, and the benchmarks every new model starts from; correction will then mean rebuilding corpora and retraining from scratch, at a cost only the largest commercial labs can bear — and those labs will have the least commercial reason to spend it on the populations being erased. The capacity to fix it and the motive to fix it will, by then, sit in different institutions.

Switzerland built Apertus to demonstrate that a different kind of foundation was possible: open, accountable, deliberately inclusive. The demonstration is incomplete while the foundation still leaves out half of humanity. Completing it is, for now, a solvable scientific problem with a known method and a credible price. It will not stay that way.

Contact: Women at the Table, Geneva | womenatthetable.net | aiequalitytoolbox.com

Sources

Clinical trial exclusion. US trials defaulted to male physiology under FDA guidance from 1977; the NIH Revitalization Act of 1993 (P.L. 103-43) was the first statutory mandate for the inclusion of women in NIH-funded clinical research; FDA, “Guideline for the Study and Evaluation of Gender Differences in the Clinical Evaluation of Drugs” (1993). Overview: NIH Office of Research on Women’s Health.

Sex-stratified AI performance. Women at the Table & FemTechnology, Invisible by Design: Women’s Health as the Blind Spot in AI and Medicine (2025) — traces the six-layer cascade from male-default clinical research to biased AI outputs, and documents that clinical AI trained on sex-misrepresentative data reduces diagnostic accuracy for women by 11.3 percentage points, that cardiac-risk algorithms underperform for women even on sex-balanced datasets, and that large language models produce divergent clinical assessments from identical case notes by patient sex label. Women at the Table, Gender Bias in Judicial Algorithms: A Global Analysis of Algorithmic Discrimination (CSW70 Expert Paper, 2026) — the parallel structural pattern in criminal-justice AI, including systematic overprediction of women’s recidivism. Primary sources are cited within both reports.

Economic value of unrecorded work. ILO, Care Work and Care Jobs for the Future of Decent Work (2018): unpaid care work ≈16.4 billion hours/day, ~76% performed by women, valued at ~US$11 trillion or ~9% of global GDP at minimum wage; UNDP and OECD corroborating estimates (regional figures exceeding 20% of GDP in Latin America). ILO (2024): an estimated 708 million women are outside the labour force owing to unpaid care responsibilities.

Recursive degradation / model collapse. Shumailov I., et al., “AI models collapse when trained on recursively generated data,” Nature 631 (2024); Wyllie S., et al. (2024) on bias amplification across generations; Gerstgrasser M., et al. (2024) on data accumulation as collapse prevention.

Benefit-sharing precedent. Nagoya Protocol to the Convention on Biological Diversity (2014); CBD COP16 Decision 16/2 (Cali, 2024) establishing the multilateral DSI mechanism and Cali Fund, with ≥50% of proceeds allocated to indigenous peoples and local communities; Fund launched February 2025 (UNDP/UNEP).

Image: Richard A Carter / https://betterimagesofai.org / https://creativecommons.org/licenses/by/4.0/

Polytype diagrams with multi-coloured, bright geometric tile designs as coinjoining ‘nodes’ with Devanagari numerals printed beneath each node. Within the larger polytope diagram, there is a smaller structure with interconnected nodes. The interconnecting edges of the diagram are dotted lines. The tiles are set out and connected to bear out different combinations of colours and designs in a systematic way, joined by dotted lines.

Comments are closed.