Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Large Language Models Think Too Fast To Explore Effectively
Authors: Lan Pan, Hanbo Xie, Robert Wilson
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | This study investigates whether LLMs can surpass humans in exploration during an open-ended task, using Little Alchemy 2 as a paradigm, where agents combine elements to discover new ones. Results show most LLMs underperform compared to humans, except for the o1 model, with those traditional LLMs relying primarily on uncertainty-driven strategies, unlike humans who balance uncertainty and empowerment. ... We evaluated the performance of five LLMs: gpt-4o-2024-08-06 (GPT-4o [26]), o1-2024-12-17 (o1 [27]), Meta-Llama-3.1-8B-Instruct (LLa MA3.1-8B [22]), Meta-Llama-3.1-70B-Instruct (LLa MA3.170B [22]), and Deep Seek-R1 [11]. |
| Researcher Affiliation | Academia | Lan Pan School of Psychology Georgia Institute of Technology Atlanta, GA, USA EMAIL Hanbo Xie School of Psychology Georgia Institute of Technology Atlanta, GA, USA EMAIL Robert C. Wilson School of Psychology Georgia Institute of Technology Atlanta, GA, USA EMAIL |
| Pseudocode | No | The paper describes methods and processes in text, such as the empowerment and uncertainty formulas and the SAE reconstruction loss minimization, but does not present any formal pseudocode or algorithm blocks. |
| Open Source Code | Yes | All custom code used for data preprocessing, LLM experiments, and regression analysis, and Sparse Autoencoder (SAE) training is available at https://github.com/Louanna1208/LLMs-Exploration. |
| Open Datasets | Yes | Data from 29,493 human participants across 4,691,033 trials establish the benchmark. The players were instructed in the rules of the game and tasked with discovering new elements. ... The LLM experimental data generated in this study are available at OSF repository (view-only link). Third-party data of human participants playing the original Little Alchemy 2 game may be shared upon reasonable request to the (EMAIL). |
| Dataset Splits | No | The paper states that human data consists of 29,493 participants and LLMs were tested with 500 trials each, but it does not specify how this human data is split into training/testing/validation sets, nor does it describe splits for the LLM experiments beyond the number of trials. |
| Hardware Specification | Yes | LLa MA3.1-70B and LLa MA3.1-8B are launched in computing cluster, with an NVIDIA A100-80GB running 13 hours and 2 hours to complete all the experiments (with all the settings and repetitions). ... Training and analyzing SAEs across all the layers takes 42 hours with an NVIDIA V100 GPU. |
| Software Dependencies | No | The paper mentions using GPT-4.1 as an automated classifier and libraries like PyTorch implicitly for SAEs, but it does not provide specific version numbers for these software components or any other key dependencies. |
| Experiment Setup | Yes | To investigate the impact of randomness on exploration, we varied the sampling temperature across four settings: 0.0, 0.3, 0.7, and 1.0 (o1 is not available to set parameters and defaults as 1), and under each temperature, there are five repetitions for running the experiment. ... We set the hidden size of latent as 8192, the same as the dimensions of the model embeddings. We set the learning rate as 1e-4, with a batch size of 256. The L2 norm is only 1e-6. |