Human Rights Benchmark for LLMs: Research Outcomes
We are advancing the Human Rights Benchmark for Large Language Models (LLMs)—a research initiative that examines how these systems align with core human rights principles.
AI models are making high-stakes decisions that directly impact human rights, but currently, no standard benchmark exists to evaluate their compliance.
In this Open Studio, Dr. Savannah Thais presents the outcomes of this work, sharing the outcomes from the benchmarking research and what it reveal about the human rights implications of LLMs. The Human Rights Benchmark Project is the first-of-its-kind, expert-annotated dataset designed to test Large Language Models (LLMs) like GPT, Claude, and Gemini on their understanding of international human rights law.
Until now, no industry-standard benchmark has existed to rigorously evaluate an LLM’s capacity to understand and respect human rights law. Existing AI benchmarks focus narrowly on technical performance, cognitive tasks (like reasoning), or limited ethical topics (like social bias). They don’t come close to measuring an LLM’s comprehension of international human rights obligations. This is the gap our Human Rights Benchmark Project is designed to fill.
The Gold Standard: Designing a Valid and Measurable Benchmark
A good benchmark must be the gold standard for evaluation. Our design process, developed through key collaboration between AI researchers and human rights experts, adheres to four critical principles to ensure construct validity:
- Real-World Fidelity: The tasks must be true to the actual work human rights professionals—like monitoring and reporting—perform.
- Measurability: The tasks must allow for objective scoring, moving beyond difficult-to-evaluate open-ended text generation where possible.
- Domain Coverage (Taxonomy): The benchmark must capture the full “axes of variation” within the human rights domain. Our taxonomy divides scenarios into analytic categories (like obligation to respect, protect, or fulfill) and descriptive categories (like actors involved and rights holders affected), including complex scenarios involving AI or conflict.
- Appropriate Metrics: Developing scoring methods that accurately assess an LLM’s response, particularly for complex legal reasoning.
From Scenario to Score: The IRAC Methodology
To build a challenging and realistic dataset, we use a three-step prompt generation process. It starts with realistic scenarios informed by human rights textbooks and legal publications. These overall scenarios are then broken down into sub-scenarios that narrow the focus.
Finally, we apply a modified legal reasoning framework called IRAC (Issue, Rule Recall, Rule Application, Proposed Remedies) to test the LLMs’ knowledge across five prompt types:
- Issue Identification: Multiple-choice questions testing the type of obligation violated (e.g., protect, fulfill).
- Rule Recall: Identifying which specific international human rights law applies to the situation.
- Rule Application: Ranking the applicability of various relevant laws.
- Proposed Remedies: An open-ended task asking the model to suggest up to ten remedial actions for the state.
Critically, every scenario, question, and answer is validated by at least three human rights experts—a crucial step to ensure the benchmark measures real human rights concepts, not just statistical artifacts.
Preliminary Results: Models Struggle with the Core of Human Rights Law
Our initial findings, focusing on the Right to Water, reveal a sobering truth. When tested on the straightforward multiple-choice questions (Issue Identification and Rule Recall), leading models like GPT-4.1, Gemini 2.5 Flash, and Claude Sonnet 4 all clustered around 50–60% accuracy.
This result is dual-edged: it confirms the benchmark is nuanced enough to differentiate performance (a score of 100% would mean the test is too easy), but it also demonstrates that these powerful models possess a surprisingly low internalized representation of human rights law.
Most strikingly, all models performed the worst on the fundamental task of identifying the nature of the state’s violated obligation (respect, protect, or fulfill). This suggests a key deficit in understanding the core legal structure of human rights. Furthermore, the models exhibited stochastic performance (variability across runs), which signals that they are reasoning rather than relying on memorized answers from the internet.
Next Steps: Scaling and Solving the Scoring Challenge
The project has successfully validated its methodology for the Right to Water and is now expanding to cover the Right to Due Process. Our biggest ongoing challenge remains accurately and scalably scoring the open-ended Proposed Remedies questions. We need to create complex, automated metrics that match the nuance of human expert judgment to encourage widespread adoption of the benchmark without requiring every developer to hire a team of human rights lawyers.
The Human Rights Benchmark Project is the first step toward holding LLMs accountable for their societal impact. By encouraging its adoption, we aim to establish a standard that pushes the AI research community to develop models that are not only intelligent but also fundamentally ethical and rights-respecting.
A new benchmark, validated by leading human rights experts, for assessing internal representations of human rights principles in leading LLMs and LRMs
HumRights-Bench is the first expert-validated framework to evaluate LLMs against human rights principles.
Large language models now influence decisions that shape people’s lives—from benefit eligibility and hiring to content moderation for billions. Yet no rigorous benchmark tests whether these systems understand basic human rights like non-discrimination, due process, or access to essential resources. Without such measures, we risk automating rights violations at unprecedented scale.