Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Language Models can Self-Improve at State-Value Estimation for Better Search

Authors: Ethan Mendes, Alan Ritter

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, STL-trained value models built on moderately sized (8B parameter) open-weight LLMs boost web agent success rates by over 39%, achieving comparable performance with proprietary models. STL also generalizes to multi-hop QA and math puzzles. We find that STL enables small open-source models to guide efficient search, reducing inference costs by integrating explicit reasoning with value learning. [...] 4 Experiments We benchmark our proposed STL self-improvement approach on applied web agent tasks, multi-step question answering, and math puzzle tasks.
Researcher Affiliation Academia Ethan Mendes, Alan Ritter Georgia Institute of Technology EMAIL, EMAIL
Pseudocode Yes See Figure 2 and Algorithm 1 in Appendix C for an overview of the method. [...] C Algorithms The STL algorithm is presented in full in Algorithm 1. For information about the task_specific_filter, see Appendix D.3 and Appendix F.4. Additionally, the algorithm for greedy search as used in ยง 4 is presented in Algorithm 2.
Open Source Code Yes Our code is available at https://github.com/ethanm88/self-taught-lookahead.
Open Datasets Yes To benchmark our STL method on web tasks, we utilize Web Shop [63] [...] We specifically utilize the Hotpot QA [62] benchmark [...] Finally, we also study the performance of STL on the Game-of-24 task [64]
Dataset Splits Yes We present results on both the full Web Shop test set and on the mini test set of 50 tasks used by [74] [...] we roll out 500 tasks from the training dataset [...] We evaluate on a set of 500 unseen questions and also a smaller set of 50 examples [...] Figure 3 shows the performance of evaluated methods on a set of 50 tasks seen during STL and a set of 50 more challenging (determined by lower human solve percentages), unseen tasks.
Hardware Specification Yes Finetuning with a single A40 GPU takes 4.5 hours for the Web Shop task. [...] Fine-tuning the value model for STL is carried out on a single NVIDIA A40 GPU.
Software Dependencies No We use Lo RA finetuning [25] and use models provided by unsloth. The hyperparameters used are in Table 8. [...] Additionally, we serve base and fine-tuned models using v LLM [33] for efficient value estimation of new states during search.
Experiment Setup Yes Following LATS for a fair comparison, we use a branching factor of 5 for all methods and 30 iterations for MCTS-based approaches. [...] We limit trajectories to five steps and train four value models depths 1 to 4, only allowing a terminating BUY action on the final step. [...] We use temperature = 1.0 and max_tokens = 3192. [...] Table 8: Hyperparameters during STL training. warmup-steps learning-rate weight-decay per-device-batch size lora-r lora-alpha STL 10 2e 4 0.01 8 16 16