Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

How to Train Your LLM Web Agent: A Statistical Diagnosis

Authors: Dheeraj Vattikonda, Santhoshi Ravichandran, Emiliano Penaloza, Hadi Nekoei, Thibault de Chezelles, Megh Thakkar, Nicolas Gontier, Miguel Muñoz-Mármol, Sahar Omidi Shayegan, Stefania Raimondo, Steve (Xue) Liu, Alexandre Drouin, Alexandre Piche, Alexandre Lacoste, Massimo Caccia

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our approach uses a two-stage pipeline, training a Llama 3.1 8B student to imitate a Llama 3.3 70B teacher via SFT, followed by on-policy reinforcement learning. We find this process highly sensitive to hyperparameter choices in setting where exhaustive sweeps are impractical. To spare others from expensive trial-and-error, we sample 1,370 configurations and use bootstrapping to estimate effective hyperparameters. Our results show that combining SFT with on-policy RL consistently outperforms either approach alone on both Work Arena and Mini Wob++. Further, this strategy only requires 55% of the compute to match the peak of pure SFT on Mini Wob++, pushing the compute performance Pareto frontier and is the only strategy that can close the gap with closed-source models.
Researcher Affiliation Collaboration 1Service Now Research 2Mila Quebec AI Institute 3Polytechnique Montréal 4HEC Montréal 5Mc Gill University 6Univeristé de Montréal
Pseudocode Yes Algorithm 1 Bootstrap Estimation of Hyperparameter Importance
Open Source Code No Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [No] Justification: Current code tied with infrastructure which is difficult to open source.
Open Datasets Yes Our experiments focus on two benchmarks. The first is Mini Wo B++, a suite of 30 medium-horizon web interaction tasks... The second is Work Arena [10], a more challenging benchmark of 33 enterprise knowledge-work tasks... These benchmarks provide a representative spectrum of sequential decision-making challenges faced by interactive LLM agents. Both benchmarks are depicted in Figure 7.
Dataset Splits Yes We evaluate generalization by training only on the train split and reporting performance on held-out tasks from the test split. For the held-out goals metric, we instantiate goal variations using seed ranges [0, 2] [8] for training tasks (3 goals per task) and [0, 9] [8] for test tasks (10 goals per task). Below we list the exact task identifiers used in our experiments for both benchmarks. Task names are the registry keys from the respective environments.
Hardware Specification Yes Our computational infrastructure comprises 8 H100-80GB GPUs for expert data generation with the 70B model. For student model training, we allocate 2 H100 GPUs for Mini Wo B++ experiments and 4 H100 GPUs for Work Arena experiments, reflecting the increased complexity of the latter.
Software Dependencies No To manage the training pipeline, we use BROWSERGYM [8] for orchestrating Chromium-based web environments and structuring the agent s action space, while AGENTLAB [8] handles agent design. Model fine-tuning is conducted with TORCHTUNE [23], utilizing Fully Sharded Data Parallelism (FSDP) to enable scalable training across multiple GPUs.
Experiment Setup Yes We conduct a random hyperparameter sweep over 1,370 training runs over the following parameter configurations: Decoding temperature (ρLLM): Sampled from {0.1, 0.25, 0.5, 0.75, 1} Curriculum learning: Enabled or disabled (True, False) Curriculum mean (µtarget): {0.25, 0.5, 0.75} Curriculum Temperature (ρCurr): {0.1, 0.3} Discount rate: {0.5, 0.8, 0.9, 0.95, 0.98, 1.0} Grouped-relative advantage: Enabled or disabled Zero-advantage filtering: Enabled or disabled Standard-deviation normalized advantage: Enabled or disabled Effective batch size: {64, 256, 512, 1024} Learning rate: {1e-6, 5e-6, 5e-7} Error log feedback: Enabled or disabled Importance ratio: Enabled or disabled