Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Position: Societal Impacts Research Requires Benchmarks for Creative Composition Tasks
Authors: Judy Hanwen Shen
ICML 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through a thematic analysis using 2 million language model user prompts, we identify creative composition tasks as a prevalent usage category where users seek help with personal tasks that require everyday creativity. Our fine-grained analysis identifies mismatches between current benchmarks and usage patterns among these tasks. |
| Researcher Affiliation | Academia | 1Department of Computer Science, Stanford University, Joint Work with Carlos Guestrin. Correspondence to: Judy Hanwen Shen <EMAIL>. |
| Pseudocode | No | The paper describes a thematic analysis methodology qualitatively, outlining steps like filtering, clustering, and coding, but it does not present any structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide concrete access to source code for its methodology. It only provides links to the datasets used and generated. |
| Open Datasets | Yes | We examine two open-access datasets, Wild Chat-1M (Zhao et al., 2024) and LMSYS-Chat-1M (Zheng et al., 2023)... We make the prompts and clusters available here: LMSYS: https://huggingface.co/datasets/heyyjudes/lmsys-creative-labeled Wild Chat: https://huggingface.co/datasets/heyyjudes/wildchat-creative-onlylabeled |
| Dataset Splits | No | The paper describes a thematic analysis of user prompts, including filtering and clustering, but it does not specify training/test/validation dataset splits typically used for reproducing machine learning experiments. |
| Hardware Specification | No | The paper describes a qualitative thematic analysis and position, and therefore does not include details on hardware specifications used for running experiments. |
| Software Dependencies | No | The paper mentions using 'Open AI moderation flags' and 'Claude 3.5 Sonnet' as tools for data processing and summarization, but it does not list specific version numbers for any software libraries or frameworks required to reproduce its analysis methodology. |
| Experiment Setup | No | The paper describes a qualitative thematic analysis and position, and therefore does not provide specific experimental setup details such as hyperparameters or training configurations typically found in machine learning experiments. |