HumRights-Bench
AI systems now decide who gets hired, who receives benefits, whose speech is removed, and which tools enter public services. Each of those is a decision about human rights. Yet until now there was no way to test whether the models making them can reason about human rights at all. HumRights-Bench is that test: the first benchmark grounded in international human rights law.
The gap
Mainstream AI evaluation asks what a model should do, measured against aggregated human preference or general ethical principles.
Human rights law asks something different and harder: what a model must recognise, because the governments, companies, and institutions deploying it are legally obligated to uphold it. No existing benchmark measured that. Safety classifiers, alignment training, and model cards touch human rights only incidentally, through vague values language rather than the actual obligation structure of international law. That is the gap HumRights-Bench was built to close.
What we built
Developed over the past year by AI & Equality by Women at the Table with researchers from Hunter College, the Oxford Internet Institute, Georgetown University, and the University of Oslo, HumRights-Bench is expert-validated and scenario-based. It was presented as an accepted poster at CS&Law 2026, and the methodology is now under submission to ICML’s AI for Law (AI4Law) track.
We adapted IRAC, the framework used to train lawyers, into IRAP: Issue Identification, Rule Recall, Rule Application, and Proposed Remedies. Substituting remedies for a binary conclusion reflects how human rights practice actually works, since practitioners do not return guilt-or-innocence verdicts but identify which obligation is engaged and what response fits the duty-bearer and the people affected.
Scenarios are realistic situations drawn from UN General Comments, Special Procedures reports, and leading jurisprudence, then validated by human rights lawyers and practitioners around the world.
The pilot covers the right to water.
What the pilot found
Tested across leading frontier models, including GPT-5, Claude, and Gemini, alongside an open-source reference model, every system performed near chance: roughly 34 to 58 percent overall.
Most telling was where they failed. Models were weakest at issue identification, recognising when a right has been violated and which obligation is engaged. That is the foundational first step, and a failure there cascades into the wrong rules and misconfigured remedies down the line. The pilot is small and the results are exploratory, but the signal is clear and the timing is urgent: the models already being deployed in rights-critical decisions cannot yet reliably perform the reasoning those decisions require. HumRights-Bench makes that failure legible to the developers, regulators, and institutions responsible for it.
Why a law-grounded benchmark matters now
For the first time, the institutional landscape is built to use this kind of evidence. The Council of Europe's Framework Convention on Artificial Intelligence designates HUDERIA as its recommended methodology for human rights risk and impact assessment across the AI lifecycle, yet HUDERIA has no empirical basis for checking whether the models being assessed can reason about the rights at stake.
HumRights-Bench supplies exactly that foundation. It can equally inform the Fundamental Rights Impact Assessments required under Article 27 of the EU AI Act, giving regulators structured, documented, reproducible evidence in place of vendor assurances.
We are bringing the results into the rooms where it counts. On 18 June 2026 at the Palais des Nations, we present HumRights-Bench to Member States in “Human Rights AI Benchmark: Can AI Understand Human Rights Law?“, convened with Globethics.
What’s next: a consortium for the world’s rights
Two directions.
The first is breadth: expanding from the right to water to the right to due process, the right to education, and further, and into multiple languages, since model reasoning varies sharply across them.
The second is the ambition that makes HumRights-Bench shared global infrastructure rather than a single research project. We are building a consortium in which specialised UN agencies own the scenarios for the rights they steward: the World Food Programme on the right to food, the International Labour Organization on the right to work, the World Health Organization on the right to health, and so on, alongside universities across different regions so the benchmark reflects how rights are realised in practice, not only how they read on paper. The institutions that hold the mandates author the test.
That is the point of HumRights-Bench: to turn human rights from a value we hope AI will honour into a standard we can hold it to.