Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

What's Producible May Not Be Reachable: Measuring the Steerability of Generative Models

Authors: Keyon Vafa, Sarah Bentley, Jon Kleinberg, Sendhil Mullainathan

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We implement this benchmark in user studies of text-to-image and large language models. Despite the ability of these models to produce high-quality outputs, they all perform poorly on steerability.
Researcher Affiliation Academia Keyon Vafa Harvard University Sarah Bentley MIT Jon Kleinberg Cornell University Sendhil Mullainathan MIT
Pseudocode No The paper describes steps for reinforcement learning in Appendix C.2 using a numbered list, but these are descriptive sentences rather than structured pseudocode or algorithm blocks with specific syntax for variables, loops, or conditional statements. For example, it lists: "1. Sample a mean from each bucket s posterior for each round 2. Choose the bucket with highest sampled mean 3. Play an episode using the corresponding mixture scales 4. Update statistics for chosen (round, bucket) pairs using the episode reward"
Open Source Code Yes We release all of the data we collect.3 https://github.com/Sarah Bentley/Steerability
Open Datasets Yes For each model, we sample a goal image from the model s producible set by prompting it with a random image caption from the Pixel Prose dataset [73]. We then show the goal image to a human user and instruct them to generate an image as close as possible to the goal image.
Dataset Splits Yes We split the data collected in Section 3 into 80/20 train/test splits.
Hardware Specification Yes The training and analyses were performed on a single A100 GPU. ... Instead, we used a server of 8 H100 GPUs to perform image generation and perturbation; we found the process to be efficient.
Software Dependencies No The paper mentions using specific models like DALL-E, Stable Diffusion, and metrics like CLIP and Dream Sim, but does not provide specific version numbers for any software libraries, programming languages, or other ancillary software components used in the experiments.
Experiment Setup Yes We study 10 text-to-image models. We consider four variants of the Stable Diffusion models: SD3-large, SD3.5-medium, SD3.5-large, and SD3.5-large-turbo [23]. We also consider two versions of DALL-E: DALL-E 2 [67] and DALL-E 3 [5]. We also consider Flux-dev [6], Flux-1.1-pro-ultra [6], Ideogram-v2-turbo [37], and Photon-flash [57]. We use publicly available APIs for each model: Stability AI for the stable diffusion models, the Open AI API for the DALL-E models, and the Replicate AI API for all other models. ... We give them 5 attempts to prompt the model. ... We trained for 60,000 episodes using Stable Diffusion 1.4 as the base model. Each episode used a different prompt and reference image sampled from our dataset. Episodes were run with 4 rounds of interaction and 2 image variations per round.