Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

The Lighthouse of Language: Enhancing LLM Agents via Critique-Guided Improvement

Authors: Ruihan Yang, Fanghua Ye, Jian Li, Siyu Yuan, yikai zhang, Zhaopeng Tu, Xiaolong Li, Deqing Yang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments in three interactive environments show that CGI outperforms existing baselines by a substantial margin. Notably, even a small critic model surpasses GPT-4 in feedback quality. The resulting actor achieves state-of-the-art performance, demonstrating the power of explicit guidance to enhance decision-making in LLM-based agents.
Researcher Affiliation	Collaboration	Ruihan Yang1, , Fanghua Ye , Jian Li , Siyu Yuan1, , Yikai Zhang2, , Zhaopeng Tu , Xiaolong Li , Deqing Yang1, 1, School of Data Science,Fudan University Tencent Hunyuan 2, College of Computer Science and Artifcial Intelligence, Fudan University EMAIL EMAIL EMAIL
Pseudocode	Yes	Algorithm 1 summarizes the CGI framework (see Appendix C for definitions of all notifications).
Open Source Code	Yes	Project Page: https://github.com/rhyang2021/CGI
Open Datasets	Yes	Web Shop [11], which is an interactive web environment for online shopping. Science World [25], which is a text-based scientific environment designed to evaluate agents scientific reasoning abilities. Text Craft [26], which is a text-based environment to create Minecraft items. Following Agent Tuning [51], we incorporate general datasets such as Share GPT33 to improve generalization. The training data for the critic model consists of expert critiques generated by the expert critic (i.e., GPT-4o) in the Science World, Webshop, and Text Craft environments, as described in 4.2. The specific training set sizes during the SFT phases for Llama3-8B-Instruct are 14K from Science World, 10K from Webshop, and 8K from Text Craft.
Dataset Splits	Yes	We evaluate our model on the test sets for these three environments (200 simulations for Science World and Web Shop, 100 for Text Craft). To collect training data, we randomly sample 500 simulations from Web Shop, 350 from Science World, and 374 from Text Craft. The dataset size for each iteration of Llama-3-8B-Instruct is detailed in Table 4.
Hardware Specification	Yes	We ran SFT experiments using 8 NVIDIA A100-40GB GPUs.
Software Dependencies	No	The paper mentions "Llama Factory code base6" and "Llama-3-8B-Instruct" model as the backbone, and "Adam [58]" as the optimizer. However, it does not specify explicit version numbers for programming languages (e.g., Python), libraries (e.g., PyTorch), or other software components beyond the model and optimizer.
Experiment Setup	Yes	Table 5: Fine-tuning hyper-parameters for Critique Generation and Action Reinement stage. Configuration: Model Llama-3-8B-Instruct, Number of epochs 3, Total Batch size 64 samples, Optimizer Adam [58] (β1 = 0.9, β2 = 0.98, ϵ = 1 10 8), Learning rate 2 10 5, Warmup Ratio 0.05, Cutoff Length 4096.