Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Position: Societal Impacts Research Requires Benchmarks for Creative Composition Tasks

Authors: Judy Hanwen Shen

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through a thematic analysis using 2 million language model user prompts, we identify creative composition tasks as a prevalent usage category where users seek help with personal tasks that require everyday creativity. Our fine-grained analysis identifies mismatches between current benchmarks and usage patterns among these tasks.
Researcher Affiliation	Academia	1Department of Computer Science, Stanford University, Joint Work with Carlos Guestrin. Correspondence to: Judy Hanwen Shen <EMAIL>.
Pseudocode	No	The paper describes a thematic analysis methodology qualitatively, outlining steps like filtering, clustering, and coding, but it does not present any structured pseudocode or algorithm blocks.
Open Source Code	No	The paper does not provide concrete access to source code for its methodology. It only provides links to the datasets used and generated.
Open Datasets	Yes	We examine two open-access datasets, Wild Chat-1M (Zhao et al., 2024) and LMSYS-Chat-1M (Zheng et al., 2023)... We make the prompts and clusters available here: LMSYS: https://huggingface.co/datasets/heyyjudes/lmsys-creative-labeled Wild Chat: https://huggingface.co/datasets/heyyjudes/wildchat-creative-onlylabeled
Dataset Splits	No	The paper describes a thematic analysis of user prompts, including filtering and clustering, but it does not specify training/test/validation dataset splits typically used for reproducing machine learning experiments.
Hardware Specification	No	The paper describes a qualitative thematic analysis and position, and therefore does not include details on hardware specifications used for running experiments.
Software Dependencies	No	The paper mentions using 'Open AI moderation flags' and 'Claude 3.5 Sonnet' as tools for data processing and summarization, but it does not list specific version numbers for any software libraries or frameworks required to reproduce its analysis methodology.
Experiment Setup	No	The paper describes a qualitative thematic analysis and position, and therefore does not provide specific experimental setup details such as hyperparameters or training configurations typically found in machine learning experiments.