Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Large Language Models Think Too Fast To Explore Effectively

Authors: Lan Pan, Hanbo Xie, Robert Wilson

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	This study investigates whether LLMs can surpass humans in exploration during an open-ended task, using Little Alchemy 2 as a paradigm, where agents combine elements to discover new ones. Results show most LLMs underperform compared to humans, except for the o1 model, with those traditional LLMs relying primarily on uncertainty-driven strategies, unlike humans who balance uncertainty and empowerment. ... We evaluated the performance of five LLMs: gpt-4o-2024-08-06 (GPT-4o [26]), o1-2024-12-17 (o1 [27]), Meta-Llama-3.1-8B-Instruct (LLa MA3.1-8B [22]), Meta-Llama-3.1-70B-Instruct (LLa MA3.170B [22]), and Deep Seek-R1 [11].
Researcher Affiliation	Academia	Lan Pan School of Psychology Georgia Institute of Technology Atlanta, GA, USA EMAIL Hanbo Xie School of Psychology Georgia Institute of Technology Atlanta, GA, USA EMAIL Robert C. Wilson School of Psychology Georgia Institute of Technology Atlanta, GA, USA EMAIL
Pseudocode	No	The paper describes methods and processes in text, such as the empowerment and uncertainty formulas and the SAE reconstruction loss minimization, but does not present any formal pseudocode or algorithm blocks.
Open Source Code	Yes	All custom code used for data preprocessing, LLM experiments, and regression analysis, and Sparse Autoencoder (SAE) training is available at https://github.com/Louanna1208/LLMs-Exploration.
Open Datasets	Yes	Data from 29,493 human participants across 4,691,033 trials establish the benchmark. The players were instructed in the rules of the game and tasked with discovering new elements. ... The LLM experimental data generated in this study are available at OSF repository (view-only link). Third-party data of human participants playing the original Little Alchemy 2 game may be shared upon reasonable request to the (EMAIL).
Dataset Splits	No	The paper states that human data consists of 29,493 participants and LLMs were tested with 500 trials each, but it does not specify how this human data is split into training/testing/validation sets, nor does it describe splits for the LLM experiments beyond the number of trials.
Hardware Specification	Yes	LLa MA3.1-70B and LLa MA3.1-8B are launched in computing cluster, with an NVIDIA A100-80GB running 13 hours and 2 hours to complete all the experiments (with all the settings and repetitions). ... Training and analyzing SAEs across all the layers takes 42 hours with an NVIDIA V100 GPU.
Software Dependencies	No	The paper mentions using GPT-4.1 as an automated classifier and libraries like PyTorch implicitly for SAEs, but it does not provide specific version numbers for these software components or any other key dependencies.
Experiment Setup	Yes	To investigate the impact of randomness on exploration, we varied the sampling temperature across four settings: 0.0, 0.3, 0.7, and 1.0 (o1 is not available to set parameters and defaults as 1), and under each temperature, there are five repetitions for running the experiment. ... We set the hidden size of latent as 8192, the same as the dimensions of the model embeddings. We set the learning rate as 1e-4, with a batch size of 256. The L2 norm is only 1e-6.