Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

More of the Same: Persistent Representational Harms Under Increased Representation

Authors: Jennifer Mickel, Maria De-Arteaga, Liu Leqi, Kevin Tian

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this work, we develop GAS(P), an evaluation methodology for surfacing distribution-level group representational biases in generated text, tackling the setting where groups are unprompted (i.e., groups are not specified in the input to generative systems). We apply this novel methodology to investigate gendered representations in occupations across state-of-the-art large language models. Our evaluation methodology reveals that there are statistically significant distribution-level differences in the word choice used to describe biographies and personas of different genders across occupations, and we show that many of these differences are associated with representational harms and stereotypes. Our empirical findings caution that naively increasing (unprompted) representation may inadvertently proliferate representational biases, and our proposed evaluation methodology enables systematic and rigorous measurement of the problem.
Researcher Affiliation	Collaboration	Jennifer Mickel Eleuther AI EMAIL Maria De-Arteaga Universitat Ramon Llull, ESADE EMAIL Liu Leqi UT Austin EMAIL Kevin Tian UT Austin EMAIL
Pseudocode	Yes	We provide the pseudocode for the Gender Association Method in Algorithm 1 and detailed discussion of the method s accuracy and validation set in Appendix A.1. Algorithm 1 Gender Association Method. Algorithm 4 Calibrated Marked Words method.
Open Source Code	Yes	The dataset of generated personas and biographies, as well as the code to reproduce our results and use the methods and metrics we propose, is located at https://github.com/jennm/more-of-the-same.
Open Datasets	Yes	We ran prompts with and without gender on 63 occupations, 60 of which are occupations in Winogender [61], and we add software engineer, cook, and pilot. ... We then compare the percentage of women in each occupation to the Bureau of Labor and Statistics (BLS) from 2024 [51]. ... For the English prior, we used the Brown corpus from NLTK [40].
Dataset Splits	No	The paper does not provide traditional training/test/validation dataset splits. Instead, it describes criteria for generating and collecting data for analysis: 'To ensure statistical significance of our findings, we generate personas until we have at least 100 personas per occupation, associated gender, and prompt. We require that at least 10% of instances be associated with each gender for an occupation to be considered due to computational limitations.' This refers to data collection for analysis, not data splitting for model training.
Hardware Specification	No	Our experiments were run using APIs provided by Open AI (for GPT-3.5 and GPT-4o-mini) and Together API (for Llama-3.1-70b). No specific hardware details (e.g., GPU models, CPU types) are provided for the underlying systems used by these APIs.
Software Dependencies	No	The paper mentions several software components like NLTK [40], Word2Vec [45], gensim [56], sklearn for K-means++ clustering, and M3-Embeddings [16]. However, it does not provide specific version numbers for these ancillary software dependencies, which are required for a reproducible description.
Experiment Setup	Yes	We run prompts with and without gender on 63 occupations... The prompts used are Generate a persona of a [OCCUPATION] and Describe a [OCCUPATION] as if you are writing a biography... To ensure statistical significance of our findings, we generate personas until we have at least 100 personas per occupation, associated gender, and prompt. ... We developed the Calibrated Marked Words method, inspired by the Marked Personas method introduced by Cheng et al. [17]. We build on this method by 1) rather than using the generated text as our prior, we use a hybrid prior consisting of both the English language and the generated text; and 2) adding a calibration step through hyperparameter tuning described in Appendix A.2 and shown in Algorithm 3. ... To identify the optimal mixing parameter α, we tested various values of α on a subset of the data in increments of 0.05 from 0 to 1, and we found α = 0.15 yielded the best results. ... To determine the optimal number of clusters to use, we use the Silhouette Score proposed by Rousseeuw [60]... We plotted the Silhouette Statistic as shown in Figure 8 and determined that 1500 clusters is the optimal number.