Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

The Lighthouse of Language: Enhancing LLM Agents via Critique-Guided Improvement

Authors: Ruihan Yang, Fanghua Ye, Jian Li, Siyu Yuan, yikai zhang, Zhaopeng Tu, Xiaolong Li, Deqing Yang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments in three interactive environments show that CGI outperforms existing baselines by a substantial margin. Notably, even a small critic model surpasses GPT-4 in feedback quality. The resulting actor achieves state-of-the-art performance, demonstrating the power of explicit guidance to enhance decision-making in LLM-based agents.
Researcher Affiliation Collaboration Ruihan Yang1, , Fanghua Ye , Jian Li , Siyu Yuan1, , Yikai Zhang2, , Zhaopeng Tu , Xiaolong Li , Deqing Yang1, 1, School of Data Science,Fudan University Tencent Hunyuan 2, College of Computer Science and Artifcial Intelligence, Fudan University EMAIL EMAIL EMAIL
Pseudocode Yes Algorithm 1 summarizes the CGI framework (see Appendix C for definitions of all notifications).
Open Source Code Yes Project Page: https://github.com/rhyang2021/CGI
Open Datasets Yes Web Shop [11], which is an interactive web environment for online shopping. Science World [25], which is a text-based scientific environment designed to evaluate agents scientific reasoning abilities. Text Craft [26], which is a text-based environment to create Minecraft items. Following Agent Tuning [51], we incorporate general datasets such as Share GPT33 to improve generalization. The training data for the critic model consists of expert critiques generated by the expert critic (i.e., GPT-4o) in the Science World, Webshop, and Text Craft environments, as described in 4.2. The specific training set sizes during the SFT phases for Llama3-8B-Instruct are 14K from Science World, 10K from Webshop, and 8K from Text Craft.
Dataset Splits Yes We evaluate our model on the test sets for these three environments (200 simulations for Science World and Web Shop, 100 for Text Craft). To collect training data, we randomly sample 500 simulations from Web Shop, 350 from Science World, and 374 from Text Craft. The dataset size for each iteration of Llama-3-8B-Instruct is detailed in Table 4.
Hardware Specification Yes We ran SFT experiments using 8 NVIDIA A100-40GB GPUs.
Software Dependencies No The paper mentions "Llama Factory code base6" and "Llama-3-8B-Instruct" model as the backbone, and "Adam [58]" as the optimizer. However, it does not specify explicit version numbers for programming languages (e.g., Python), libraries (e.g., PyTorch), or other software components beyond the model and optimizer.
Experiment Setup Yes Table 5: Fine-tuning hyper-parameters for Critique Generation and Action Reinement stage. Configuration: Model Llama-3-8B-Instruct, Number of epochs 3, Total Batch size 64 samples, Optimizer Adam [58] (β1 = 0.9, β2 = 0.98, ϵ = 1 10 8), Learning rate 2 10 5, Warmup Ratio 0.05, Cutoff Length 4096.