Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Position: Societal Impacts Research Requires Benchmarks for Creative Composition Tasks

Authors: Judy Hanwen Shen

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through a thematic analysis using 2 million language model user prompts, we identify creative composition tasks as a prevalent usage category where users seek help with personal tasks that require everyday creativity. Our fine-grained analysis identifies mismatches between current benchmarks and usage patterns among these tasks.
Researcher Affiliation Academia 1Department of Computer Science, Stanford University, Joint Work with Carlos Guestrin. Correspondence to: Judy Hanwen Shen <EMAIL>.
Pseudocode No The paper describes a thematic analysis methodology qualitatively, outlining steps like filtering, clustering, and coding, but it does not present any structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide concrete access to source code for its methodology. It only provides links to the datasets used and generated.
Open Datasets Yes We examine two open-access datasets, Wild Chat-1M (Zhao et al., 2024) and LMSYS-Chat-1M (Zheng et al., 2023)... We make the prompts and clusters available here: LMSYS: https://huggingface.co/datasets/heyyjudes/lmsys-creative-labeled Wild Chat: https://huggingface.co/datasets/heyyjudes/wildchat-creative-onlylabeled
Dataset Splits No The paper describes a thematic analysis of user prompts, including filtering and clustering, but it does not specify training/test/validation dataset splits typically used for reproducing machine learning experiments.
Hardware Specification No The paper describes a qualitative thematic analysis and position, and therefore does not include details on hardware specifications used for running experiments.
Software Dependencies No The paper mentions using 'Open AI moderation flags' and 'Claude 3.5 Sonnet' as tools for data processing and summarization, but it does not list specific version numbers for any software libraries or frameworks required to reproduce its analysis methodology.
Experiment Setup No The paper describes a qualitative thematic analysis and position, and therefore does not provide specific experimental setup details such as hyperparameters or training configurations typically found in machine learning experiments.