Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Position: Evaluating Generative AI Systems Is a Social Science Measurement Challenge
Authors: Hanna Wallach, Meera Desai, A. Feder Cooper, Angelina Wang, Chad Atalla, Solon Barocas, Su Lin Blodgett, Alexandra Chouldechova, Emily Corvi, P. Alex Dow, Jean Garcia-Gathright, Alexandra Olteanu, Nicholas J Pangakis, Stefanie Reed, Emily Sheng, Dan Vann, Jennifer Wortman Vaughan, Matthew Vogel, Hannah Washington, Abigail Z. Jacobs
ICML 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Theoretical | In this position paper, we argue that the ML community would benefit from learning from and drawing on the social sciences when developing and using measurement instruments for evaluating Gen AI systems. Specifically, our position is that evaluating Gen AI systems is a social science measurement challenge. We present a four-level framework, grounded in measurement theory from the social sciences, for measuring concepts related to the capabilities, behaviors, and impacts of Gen AI systems. |
| Researcher Affiliation | Collaboration | Hanna Wallach 1 Meera Desai 2 A. Feder Cooper 1 Angelina Wang 3 Chad Atalla 1 Solon Barocas 1 Su Lin Blodgett 1 Alexandra Chouldechova 1 Emily Corvi 1 P. Alex Dow 1 Jean Garcia-Gathright 1 Alexandra Olteanu 1 Nicholas Pangakis 1 Stefanie Reed 1 Emily Sheng 1 Dan Vann 1 Jennifer Wortman Vaughan 1 Matthew Vogel 1 Hannah Washington 1 Abigail Z. Jacobs 2 1Microsoft Research 2University of Michigan 3Stanford University. Correspondence to: Hanna Wallach <EMAIL>. |
| Pseudocode | No | The paper presents a conceptual framework and discusses processes (systematization, operationalization, application, interrogation) in descriptive text and a diagram (Figure 1), but does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper is a position paper proposing a framework for evaluating Gen AI systems. It does not present new methodology that would require source code release. |
| Open Datasets | No | The paper is theoretical and does not conduct experiments using datasets. It refers to existing benchmarks and datasets in hypothetical examples (e.g., "International Math Olympiad problems" as a measurement instrument for a hypothetical task), but these are not used by the authors for their own experimental work in this paper. |
| Dataset Splits | No | The paper is theoretical and does not conduct experiments, therefore, it does not provide specific dataset split information for reproduction. |
| Hardware Specification | No | The paper is theoretical and does not conduct experiments, thus no hardware specifications are provided. |
| Software Dependencies | No | The paper is theoretical and does not conduct experiments. Therefore, it does not list specific software dependencies with version numbers for reproducing experimental results. |
| Experiment Setup | No | The paper presents a theoretical framework and does not describe any experiments. As such, it does not provide specific experimental setup details or hyperparameters. |