Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Distributive Fairness in Large Language Models: Evaluating Alignment with Human Values
Authors: Hadi Hosseini, Samarth Khanna
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this paper, we examine whether LLM responses adhere to fundamental fairness concepts such as equitability, envy-freeness, and Rawlsian maximin, and investigate their alignment with human preferences. We evaluate the performance of several LLMs, providing a comparative benchmark of their ability to reflect these measures. Our results demonstrate a lack of alignment between current LLM responses and human distributional preferences. |
| Researcher Affiliation | Academia | Hadi Hosseini Penn State University, USA EMAIL Samarth Khanna Penn State University, USA EMAIL |
| Pseudocode | No | The paper describes the methodology and evaluation process in prose, without structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Data and Code: github.com/Samarth Khanna/Distributive-Fairness-LLMs |
| Open Datasets | Yes | Dataset. We adopt instances from the dataset that was developed by Herreiner and Puppe [50]. |
| Dataset Splits | No | The paper evaluates pre-trained LLMs on specific resource allocation problem instances (I1-I10 and additional instances). It does not describe conventional dataset splits (e.g., training, validation, test sets) for training or fine-tuning the LLMs' behavior, as the LLMs are used as off-the-shelf models. The instances themselves serve as the evaluation set. |
| Hardware Specification | No | We use APIs to collect responses from each of the LLMs. The details of the API provided and exact model names are provided in Table 14. Since the experiments involve the use of APIs, and inference/computation is not performed locally, the details of the machine on which the experiment has been done have not been provided. |
| Software Dependencies | No | Table 14: Details of models used. Model API provider Model string GPT-4o Open AI gpt-4o-2024-05-13 Claude-3.5S Anthropic claude-3-5-sonnet-20240620 Gemini-1.5P Google gemini-1.5-pro Llama3-70b Groq llama3-70b-8192. The paper specifies the LLMs and their API versions used for the experiments, but does not list specific programming languages, libraries, or frameworks with version numbers for the authors' own implementation or data analysis. |
| Experiment Setup | Yes | Each model is used with the default temperature of 1.0 to enable a wider range of responses. To generate a representative set of responses, each model was queried 100 times on each instance. We implement an approach we call two-stage prompting strategy to eliminate sensitivity to templates. We refer the reader to Appendix J and Appendix G for details on prompt design and prompt sensitivity analysis. |