Women at the Table

The Challenge

AI systems increasingly make decisions that directly impact human lives—determining who gets hired, who receives social services, whose content gets amplified or suppressed, and who accesses essential resources. Yet no rigorous benchmark exists to evaluate whether these systems understand fundamental human rights principles like non-discrimination, due process, or equitable access to resources.

The gap is critical: without measurable standards grounded in international human rights law, we risk automating rights violations at unprecedented scale. Existing AI benchmarks measure performance on cognitive tasks or narrow ethics topics, but none evaluate how well AI systems recognize and respect the human rights frameworks that should govern their deployment in high-stakes contexts.

Our Approach

Led by Dr. Savannah Thais, Machine Learning and Society Lead at Women at the Table, HumRights-Bench is the first expert-validated framework to evaluate large language models (LLMs) against human rights principles.

The benchmark methodology combines technical rigor with human rights expertise:

  • Real-World Scenarios: We construct complex, realistic scenarios across diverse social contexts—from water access in informal settlements to due process in legal systems—capturing how rights violations manifest in practice. Each scenario is modular, allowing us to measure how AI systems respond to different locations, marginalized groups, and contextual factors.
  • Legal Reasoning Framework: Adapted from IRAC legal methodology and validated in Legal-Bench, our IRAP approach tests whether AI systems can identify rights violations, recall relevant legal frameworks, apply appropriate provisions, and propose effective remedies. Question formats range from multiple-choice and ranking tasks to open-ended remedy proposals.
  • Expert Validation: Every scenario and assessment is validated by at least three human rights professionals, ensuring alignment with international human rights law and real-world practice. This isn’t AI researchers’ interpretation of human rights—it’s measurement grounded in the frameworks and expertise of practitioners.
  • Taxonomic Rigor: We systematically map the human rights problem space by typologies of violations, perpetrators, affected stakeholders, social contexts, and complex conditions like armed conflict or indigenous rights—ensuring comprehensive coverage rather than cherry-picked examples.

The Impact

HumRights-Bench creates accountability infrastructure for AI systems before they scale:

  • For human rights organizations: Evidence-based guidance on which AI models understand rights principles well enough to use safely in their workflows—and which don’t.
  • For AI developers: Measurable standards to evaluate and improve models’ human rights performance before deployment, integrated into the development cycle rather than discovered after harm occurs.
  • For policymakers and regulators: Objective benchmarks to inform AI governance frameworks, procurement standards, and accountability requirements grounded in international human rights law.

Starting with the right to water and due process, HumRights-Bench builds a scalable foundation to assess emerging models against all rights enumerated in the Universal Declaration of Human Rights. Published openly for the global research community, validated by leading human rights experts, and designed for practical application, the benchmark transforms abstract human rights commitments into concrete technical standards.

This isn’t incremental improvement in AI ethics—it’s establishing the measurement infrastructure that should have existed before AI systems began making decisions affecting billions of lives.

Comments are closed.