The Model That Can Prove It.
What Switzerland Built — and the Gap It Hasn’t Closed Yet.
Last week I went to Bern for the Swiss IGF. In the room were some of the people who built Apertus — Switzerland’s open-source large language model, released last September by EPFL, ETH Zurich, and the national supercomputing centre. I want to tell you what they got right, because it is genuinely rare, and then I want to tell you about the one thing they, and every other model on earth, have not yet fixed. Because Apertus is the only model in the world where that thing can actually be fixed, and the window to do it is the next time they hit “train.”
Let’s start with what they got right, because this builds the whole argument.
When you build a foundation model, you feed it an enormous amount of text scraped from the internet, and the model learns the world from that text. The catch is that the internet is not the world. It over-represents some languages, some places, some people, and barely contains others. Most model builders treat that distribution as a fact of nature — you train on what’s there.
The Apertus team refused to. They looked at the linguistic composition of the web and decided it was not neutral but correctable. They deliberately sourced and weighted the training data so that 40% of it was non-English, across more than a thousand languages. Romansh, a Swiss language with roughly 60,000 speakers, is in the model because they decided it should be. They built inclusion as a design requirement, not a patch applied afterward, and they published everything: the weights, the data, the recipe. You can read how it was made.
Hold onto that, because it is the precedent for everything that follows.
Switzerland did for language exactly what no one has done for women.
In the first piece in this series, I argued that the case for representation in AI is not, at root, a gender argument — it is a data quality argument. A system trained on systematically unrepresentative data is a scientifically deficient system, and that framing survives technical scrutiny in rooms where the gender framing gets waved through and forgotten. Apertus is the sharpest possible instance of that argument, because the Apertus team has already accepted it — for language. They just haven’t yet applied the same method to the other great gap in the training record.
That gap is women.
Here is the number I want you to carry out of this piece. Researchers have shown that you can take a state-of-the-art language model, give it an identical set of clinical case notes, and get a different medical assessment depending only on whether the patient is labelled female or male. Same symptoms. Same words. Different answer, because the patient is a woman.
That is not a glitch. It is the model faithfully reproducing a medical record that was built on men. For decades, as a matter of policy, women were excluded from clinical trials; United States research defaulted to male physiology until 1993. The textbooks, the dosing guidelines, the description of what a heart attack looks like — calibrated on male bodies. When AI learns medicine from that record, it does not correct the omission. It scales it. Clinical AI trained on sex-misrepresentative data has been found to reduce diagnostic accuracy for women by 11.3 percentage points. Cardiac risk algorithms underperform for women even when the dataset is balanced, because the underlying clinical knowledge was not. (These findings are documented in Invisible by Design, the report my organization published with FemTechnology last year.)
This is not a problem at the edge of medicine, affecting some small group. It sits at the centre, and it affects all women before it affects any particular woman — and then it compounds, fastest and hardest, for women already furthest from the data: rural, non-English-speaking, poor, in the Global South, where these systems are increasingly deployed with the least oversight.
And it is not only health.
The same blankness runs through the economic record. The unpaid care work that women perform — the cooking, the child-rearing, the elder care — is worth, by the International Labour Organization’s estimate, around 11 trillion dollars a year, close to 9% of global GDP. Almost none of it appears in the economic statistics that AI systems learn from. Neither does most of women’s informal and home-based earning. So the models now entering credit scoring, welfare allocation, and economic planning are structurally blind to a vast share of what women actually do. A model cannot price, score, or plan around what its training data never recorded.
A model used in a clinic, a loan decision, or a benefits office is making a judgment about a woman in roughly half the cases it handles. Half. This is not a marginal defect. It is a defect in the core function of the system, affecting half of every population it serves.
Why it gets worse if we wait.
Here is the part that turns this from a problem into a deadline. The intuition is that data gaps fill in over time — more gets digitized, the record improves. The opposite is now happening, and it is documented in the peer-reviewed literature, not speculation.
Increasingly, AI models are trained on text generated by earlier AI models, because the web is filling up with synthetic content. When that happens, researchers have shown that the rare and thinly-represented parts of the data disappear first, while the model stays fluent enough that no one notices the loss. Women’s data is already the thin part. So each new generation of models doesn’t dilute the gap — it deepens it, quietly, while sounding more confident than ever. I wrote about this mechanism in more detail in a companion brief — what we’ve called the half-life of a biased model.
I want to be precise, because this is where a careful reader pushes back: nobody can yet tell you the exact rate of that decay, and serious labs filter synthetic data specifically to slow it. We don’t claim a date. What is not in dispute is the direction. The door moves one way. The cost of fixing this rises with every training generation, and the window in which corrective data can still enter the foundation of these systems is open now and will not stay open.
Why Apertus, specifically.
Every frontier model carries this gap. Almost none of them can be examined, because their training data is secret — you can only guess at what’s inside from what comes out. Apertus is different. Its corpus is open and documented. That makes it, as far as I know, the only frontier-scale model in the world where you can actually open the hood and measure the gap directly: what share of the medical text describes female physiology, what share of the economic text captures women’s work, how representation breaks down across all those languages.
That number has never been measured for any frontier model, because no one has ever had an open one to measure. It does not exist. And its absence is the quiet hole underneath every claim that these systems are ready to be trusted with women’s health or women’s money.
I am not saying Apertus is uniquely biased. Every model is. I am saying Apertus is uniquely fixable — and that the people who built it have already proven they know how, because they did it for Romansh. The method is the same: measure where the data is thin, then deliberately source and weight to fill it, on the next training run. The audit to produce that measurement would cost a rounding error against what the model cost to train. The fix follows a path the team has already walked.
This is the conversation I’m bringing to Bern. Not an accusation. An invitation: you built the one model where this can be solved, and you’ve already shown you know how — so be the ones to solve it, and set the standard the rest of the field has to answer to.
Where this connects to the larger fight.
In this series I keep returning to one idea: the goal is not to defend representation in AI against the people trying to strip it out. It is to build the architecture so well — so grounded in scientific standards and economic evidence — that leaving women out becomes technically and politically incoherent.
Apertus is where that gets concrete. And it connects directly to the governance work I’ll write about next: we have proposed, as a measurable indicator for the WSIS process, that high-risk AI systems should have to demonstrate they were trained on representative data and validated across sexes before they are deployed in health or justice. That indicator needs a place to be demonstrated first. Apertus is that place. The model and the measurement are two halves of one thing.
The medieval builders in the image above didn’t argue about whether the edifice should stand. They argued about the foundation, while it could still be poured. That is exactly where we are. The foundation of the AI era is being laid right now, and for once we can see into it — because Switzerland built a model honest enough to be read.
The question I’m carrying to Bern is whether the people who poured it will choose to pour the other half. The next training run is when they’ll decide, whether they mean to or not.
There is no neutral version of that decision.
Caitlin Kraft-Buchman is Executive Director of Women at the Table / AI & Equality Human Rights Initiative and Co-Chair of the CSTD Gender Advisory Board. The findings on sex-stratified AI performance are drawn from Invisible by Design: Women’s Health as the Blind Spot in AI and Medicine (Women at the Table & FemTechnology, 2025). The mechanism of bias compounding across model generations is set out in the companion brief “The Recursion Deadline.“
Image Credit: Richard A Carter / https://betterimagesofai.org / https://creativecommons.org/licenses/by/4.0/. Polytype diagrams with multi-coloured, bright geometric tile designs as coinjoining ‘nodes’ with Devanagari numerals printed beneath each node. The interconnecting edges of the diagram are dotted lines; in some cases, the nodes are connected using solid lines. The tiles are set out and connected to bear out different combinations of colours and designs in a systematic way, joined by dotted and solid lines.