Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

CLAWS:Creativity detection for LLM-generated solutions using Attention Window of Sections

Authors: Keuntae Kim, Eunhye Jeong, Sehyeon Lee, Seohee Yoon, Yong Suk Choi

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	CLAWS outperforms five existing white-box detection methods Perplexity, Logit Entropy, Window Entropy, Hidden Score, and Attention Score on five 7 8B math RL models (Deep Seek, Qwen, Mathstral, Open Math2, and Oreal). We validate CLAWS on 4,545 math problems collected from 181 math contests (A(J)HSME, AMC, AIME).
Researcher Affiliation	Academia	Keuntae Kim1 Eunhye Jeong2 Sehyeon Lee3 Seohee Yoon1 Yong Suk Choi1 1Department of Computer Science 2Department of Artificial Intelligence 3Department of Future Mobility Hanyang University, Seoul, Korea EMAIL
Pseudocode	No	The paper does not contain any structured pseudocode or algorithm blocks. The methods are described in narrative text and mathematical formulas.
Open Source Code	Yes	Our code is available at https://github.com/kkt94/CLAWS.
Open Datasets	Yes	To conduct our study, we adopted publicly available math datasets from Creative Math [11] and HARP [27].
Dataset Splits	Yes	We generate solutions using the Generators described in Section 2.2.1, and construct the reference set and test set from the Creative Math, and the extended test set from the HARP. The reference set serves as a low-resource for detection, the test set is used for validation on the same dataset, and the extended test set is used for validation on an extended dataset. ... Reference Set We select 29 problems from the Creative Math dataset, considering the distribution of difficulty levels. ... We generate 20 responses for each input prompt using stochastic decoding. ... Test Set Test set consists of the remaining 371 problems with solutions from Creative Math. For each problem, three responses are generated using stochastic decoding. ... Extended Test Set ... utilizes problems and solutions from four math competitions A(J)HSME, AMC, and AIME compiled in the HARP. For each problem, one response is generated.
Hardware Specification	Yes	All generations were performed in parallel on eight NVIDIA RTX A5000 GPUs (24GB VRAM).
Software Dependencies	No	The paper mentions specific LLM evaluators (Gemini-1.5-Pro, GPT-o4-mini) and LLM generators (Deep Seek-Math-7B-RL, Qwen2.5-Math-7B-Inst, Mathstral-7B, Open Math2-Llama3.1-8B, and OREAL-7B) with their names/versions, and evaluation algorithms (XGBoost, MLP, Tab M). However, it does not provide version numbers for general software dependencies like Python, PyTorch, or CUDA libraries which would be essential for full reproducibility.
Experiment Setup	Yes	For all LLM Generators, the maximum input token length was set to 2000, and the maximum output token length was limited to 1023. Top-p was fixed at 1.0, and Top-k was fixed at 50 across all models. Temperature values were adjusted for each model to encourage the generation of Creative Solutions, and the final settings used for dataset construction are as follows: Deep Seek-Math-7B (deepseek-ai/deepseek-math-7b-rl): 0.7, Mathstral-7B (mistralai/Mathstral-7b-v0.1): 0.25, Open Math2-LLa MA3.1-8B (nvidia/Open Math2-Llama3.1-8B): 1.0, OREAL-7B (internlm/OREAL-7B): 0.7, Qwen2.5-Math-7B (Qwen/Qwen2.5-Math-7B-Instruct): 0.7. ... MLP We use a three-layer feed-forward neural network. The model is trained for 10 epochs using cross-entropy loss with class weights to account for class imbalance. Optimization is performed using Adam with a learning rate of 0.001.