Who Collects the Data: Building AI's Missing Record on Women

Who collects the record: An architecture for building the missing data on women

14 June 2026
Women at the Table | A+ Alliance for Inclusive Algorithms | Geneva

Part 3 of three companion papers. Part 1 (“The Recursion Deadline”) makes the scientific case for why corrective data on women must enter the foundational layer now. Part 2 (“The data gap AI can still close“) applies it to the open Swiss model Apertus as a concrete, near-term opportunity. Part 3 (this paper) sets out who builds the data and how.

The premise

The data does not exist. That is the finding of the first two papers in this series: women’s health beyond the male clinical default, women’s informal and unpaid economic activity, women’s voices in low-resource languages — these are not hiding somewhere on the web waiting for better scraping. They were never written down. So the data must be created, and the question of who creates it is not logistics. It decides whether the project is repair or a second extraction. Data about underrepresented women, collected by institutions those women have no relationship with and no power over, reproduces the original problem with better metadata.

This paper proposes the field force, the pipeline, the ownership model, and the honest risks.

The proposition: women’s rights organizations as the collection workforce

There are tens of thousands of women’s rights organizations operating at community level worldwide — running health programs, supporting informal workers, documenting violence, organizing in languages no research consortium speaks. We propose training, certifying, equipping and paying these organizations as the data collection workforce for the global commons.

The case rests on three capabilities no alternative field force possesses. Trust: these organizations hold standing relationships with exactly the populations in the statistical tail — relationships built over decades, which no university IRB process or commercial data vendor can purchase. A woman will describe her symptoms, her work, her income to the organization that ran her literacy class; she will not describe them to an enumerator with a tablet and a per-interview quota. Consent practice: rights organizations navigate informed consent, confidentiality and safeguarding as core competencies, often in contexts where a signature on a form means little and real consent is relational and ongoing. Reach: they are already physically present in the last mile — rural districts, informal settlements, conflict-adjacent regions — where the data gap is widest and where parachute collection fails or causes harm.

The proposition also inverts the political economy of AI data. In the current pipeline, communities are the mined resource and value accrues elsewhere. Under this architecture, women’s organizations and the communities they serve become producers and owners of the scarcest asset in the AI ecosystem — verified, provenanced, consented human data about populations no lab can otherwise reach. Licensing that asset (see governance, below) creates a durable revenue stream for a chronically underfunded global movement. Data collection becomes not a distraction from the rights mission but a financing mechanism for it.

This is not untested. Masakhane has built African-language NLP datasets through community researchers. Karya in India pays rural workers — majority women — above-market rates for speech and annotation data, demonstrating that dignified, well-paid data work at community level produces high-quality corpora. Mozilla Common Voice showed volunteer voice collection can scale across languages. What has never been done is assembling these proofs of concept into standing infrastructure with rights organizations as the institutional backbone — and with ownership retained where the data originates.

What gets collected first

Priority is set by three filters applied together: where the stakes are highest, where the documented gap is widest, and where collection is feasible without endangering contributors. Three domains pass all filters.

Health narratives. Symptom descriptions, treatment experiences, maternal and menopausal health, in women’s own words and languages — the corpus that would let a clinical model learn what female presentation actually sounds like, including where it diverges from the male-default textbook. This is the data layer beneath the sex-stratified performance requirements documented in Invisible by Design (Women at the Table & FemTechnology, 2025): a model cannot be validated as performing across sexes if the data describing female presentation was never collected. Gathered through the health programs rights organizations already run.

Time-use and informal work. Structured diaries and oral accounts of unpaid care, informal trading, home-based production — the economic activity national statistics ignore. This corpus makes women’s work legible to the credit-scoring, social-protection and economic-planning models currently trained on its absence.

Voice and language. Speech corpora of women’s voices in low-resource languages. Voice interfaces — increasingly the access layer for health, finance and government services among low-literacy populations — fail worst for exactly these speakers. Read speech, conversational speech, and the local-language health and economic vocabulary the other two domains generate.

The pipeline: five layers

Standards before collection. A common protocol covering consent (ongoing and revocable, not one-time), anonymization and de-identification, provenance certification (cryptographically documented chain from contributor to commons — the property that makes this data valuable in a synthetic-flooded ecosystem), annotation schemas, and data dictionaries per domain. Built once, openly published, adapted per region with partner organizations — not imposed.
Training and certification. A structured curriculum converting the standards into practice: consent in low-literacy contexts, collection tools, quality procedures, safeguarding, data security. Certification sits with the AI & Equality initiative — the institutional home of the Toolbox — extending an existing, validated pedagogy (delivered across five continents, with a 300-researcher community and university partnerships from Makerere to CENIA) from teaching rights-based AI analysis to certifying rights-based data collection. Certification is what lets a funder, a sovereign model program, or an Apertus training run trust the corpus without knowing every collector personally — and it keeps the standards authority with the rights-based institution rather than with any data custodian or implementation partner.
Collection tooling. Mobile-first, offline-capable, voice-first for low-literacy contexts; consent capture built into the instrument, not appended; local-language interfaces. Built open-source so no vendor owns the pipeline’s throat.
Validation and annotation. A second tier of paid community work: native-speaker validation, annotation against the published schemas, inter-annotator agreement procedures. Quality control designed as employment, inside the same communities — following the Karya evidence that paying well produces better data than paying badly.
Custody and access. Certified corpora flow to the commons described in the companion papers — federated where data is sensitive (health data may never need to leave a regional custodian; models can come to the data), with the Geneva anchor providing standards, certification authority, and the trusted neutral registry.

Ownership and the licensing engine

Ownership stays at origin. The governing instruments are data trusts or cooperatives in which contributing communities and organizations hold the rights, operating under the CARE principles (Collective benefit, Authority to control, Responsibility, Ethics) developed by the Indigenous data sovereignty movement — the most rigorous existing framework for exactly this problem. International law supplies the benefit-sharing precedent in two layers: the Nagoya Protocol (2014) established that access to community-held genetic resources is conditioned on prior informed consent and benefit-sharing, and COP16’s Cali Fund (2024–25) extended that principle to digitized derivatives of those resources with a fixed community allocation. Together they give this architecture a treaty-grounded model for the otherwise novel claim that commercial use of community data carries an enforceable benefit-sharing obligation.

Access is tiered, and this is where the architecture grows teeth. Public-interest and sovereign models that meet defined conditions — including the intersectional performance floors proposed in our panel brief — receive access at low or no cost. Commercial frontier labs pay commercial rates, with revenue flowing back through the trusts to the collecting organizations and contributing communities. The labs’ own scarcity problem makes this viable: verified human data about populations absent from the web is the one input they cannot synthesize, scrape, or buy elsewhere.

This is not a hypothetical legal structure. In 2024–25 the world’s governments built exactly this mechanism for an adjacent resource. Under the Convention on Biological Diversity, COP16 established the Cali Fund: commercial users of digital sequence information — genetic data, the biological analogue of our case — are expected to pay into a global fund, with at least 50% earmarked for indigenous peoples and local communities who steward the resource. It treats a digitized resource derived from community-held assets as something whose commercial use triggers benefit-sharing obligations, and it routes the proceeds to the stewards. That is the precise logic of this licensing engine, now ratified by 196 governments and operational under UNDP/UNEP. The objection “you cannot make labs pay communities for data” is already answered: the international community decided the principle, for genetic data, in 2024. We are proposing its extension to the data on women.

What can go wrong

Honesty about failure modes is part of the architecture. Extraction by good intentions: even a rights-framed pipeline can drift toward treating communities as suppliers; the counterweights are ownership-at-origin, revocable consent, and governance seats for contributors, not just collectors. Weaponization: data about women — health status, income, location — is dangerous in the wrong hands, and in some jurisdictions the wrong hands are the government’s; the counterweights are federated custody, aggressive de-identification standards, and a published threat model per region, with the option not to collect where protection cannot be guaranteed. Mission distortion: rights organizations must not become data factories; collection should be designed as a bounded, paid program line, never a condition of core funding. Sampling bias: organizations reach their own networks, which are not random; the counterweight is documented sampling frames and honest coverage statements — a corpus that states its limits is scientific; one that hides them is marketing. Synthetic contamination: the commons’ entire value is human provenance, so contributor-side verification must be designed in from the first pilot, not retrofitted.

Phasing

Phase 1 (18 months): standards published; curriculum and certification built on AI & Equality infrastructure; pilots with three to five organizations across three regions and all three domains; first certified corpora; governance instruments incorporated. Ballpark cost: CHF 2.5–3.5 million, comprising standards and legal development (≈0.5M), curriculum and certification build (≈0.3M), open-source collection tooling adapted from existing platforms (≈0.4M), pilot grants to partner organizations including collector wages and contributor payments (≈0.8–1.2M), validation and annotation (≈0.25M), governance incorporation and regional threat modeling (≈0.35M), and program management with independent evaluation (≈0.4M). Phase 2: certification opened to a first cohort of organizations through existing university and Toolbox partnership networks; first licensing agreements; first corpus delivered to a sovereign model training run as proof of the full loop. Phase 3: standing operation — the collection infrastructure runs as permanent institution, the observatory measures whether the corpora are moving model performance, and the licensing revenue progressively replaces grant dependence.

The cost of Phase 1 is the cost of a mid-sized research project. The asset it creates is the input every model ecosystem on earth will eventually need and cannot make for itself.

References

Nagoya Protocol on Access to Genetic Resources and the Fair and Equitable Sharing of Benefits Arising from their Utilization to the Convention on Biological Diversity (adopted 2010, entered into force 2014) — Art. 5 (benefit-sharing), Art. 6–7 (prior informed consent, including of indigenous and local communities).

Convention on Biological Diversity, COP16 Decision 16/2 (Cali, Nov 2024) establishing the multilateral mechanism and the Cali Fund for benefit-sharing from Digital Sequence Information (DSI) on genetic resources; Fund launched Rome, Feb 2025 (UNDP/UNEP/MPTFO). Note the ≥50% allocation to indigenous peoples and local communities — the closest existing precedent for the licensing engine proposed here, extending benefit-sharing from physical resources to their digitized form.

Carroll, S.R., Garba, I., Figueroa-Rodríguez, O.L., et al. (2020). “The CARE Principles for Indigenous Data Governance.” Data Science Journal, 19(1), 43. Global Indigenous Data Alliance.

Nekoto, W., et al. (Masakhane) (2020). “Participatory Research for Low-resourced Machine Translation: A Case Study in African Languages.” Findings of EMNLP 2020.

Ardila, R., et al. (2020). “Common Voice: A Massively-Multilingual Speech Corpus.” Proceedings of LREC 2020. Mozilla Foundation.

Karya (karya.in) — social enterprise model for dignified, above-market-wage data work in rural India; majority-women workforce.

Shumailov, I., et al. (2024). “AI models collapse when trained on recursively generated data.” Nature, 631, 755–759. Gerstgrasser, M., et al. (2024) and Kazdan, J., et al. (2024) on data accumulation as collapse prevention. Wyllie, S., et al. (2024) on bias amplification under recursive generation. (Full recursion literature cited in the companion panel brief.)

Women at the Table & FemTechnology (2025). Invisible by Design: Women’s Health as the Blind Spot in AI and Medicine. Women at the Table (2026). Gender Bias in Judicial Algorithms: A Global Analysis of Algorithmic Discrimination (CSW70 Expert Paper). — the harm evidence motivating the health and high-stakes collection priorities above.

Contact: Women at the Table, Geneva | womenatthetable.net | aiequalitytoolbox.com

Image: Richard A Carter / https://betterimagesofai.org / https://creativecommons.org/licenses/by/4.0/

Polytype diagrams with geometric tile designs as coinjoining ‘nodes’ with Devanagari numerals printed beneath each node. Some of the nodes are in grey scale and the interconnecting edges of the nodes are joined by dotted lines. A connected cuboid of nodes in the top right is in colour and shows blue geometric tile designs; these nodes are joined by solid lines.

Comments are closed.

Who Collects the Data: Building AI’s Missing Record on Women