Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Self-Challenging Language Model Agents

Authors: Yifei Zhou, Sergey Levine, Jason E Weston, Xian Li, Sainbayar Sukhbaatar

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments are conducted in four different tool-use environments from M3Tool Eval [37] and Tau-Bench [46] spanning tool-based calculations, web browsing, retail services, and flight booking. We apply our method to generate synthetic tasks and rely purely on these synthetic tasks to fine-tune the LLM agent before evaluating on the existing out-of-distribution test tasks. Empirically, we establish the advantages of our Self-Challenging Agent (SCA) framework in two important settings: distillation, where the goal is to distill the expertise of a stronger model to a weaker model without any existing tasks, and self-improvement, where in the absence of a stronger model the weaker model needs to supervise itself to make progress.
Researcher Affiliation Collaboration Yifei Zhou UC Berkeley Sergey Levine UC Berkeley Jason Weston FAIR, Meta Xian Li FAIR, Meta Sainbayar Sukhbaatar FAIR, Meta Work done at FAIR, Meta. Correspondance to EMAIL.
Pseudocode No The paper describes algorithms such as REINFORCE, DPO, PPO, and GRPO by name and provides mathematical formulations, but it does not include any structured pseudocode blocks or algorithm figures within the main text or appendices.
Open Source Code No Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [No] Justification: N/A
Open Datasets Yes Our experiments are conducted in two multi-turn tool-use LLM agent benchmarks, featuring tasks from four different environments that come equipped with functional verifiers for reliable evaluations. M3Tool Eval [37] is a multi-turn function-calling benchmark where the success of each task is determined by pattern-matching the agent s final answer with the reference solution. Tau-Bench [46] is a multi-turn customer service environment where the LLM agent needs to interact with a user (simulated by GPT-4o [20]), query the database, and make corresponding modifications to fulfill the user requests.
Dataset Splits Yes For both settings, we generate 800 synthetic tasks and 12k offline rollout trajectories. Pass@1 results are averaged over four independent trials, and pass@4 is calculated from the same four trials. We apply our method to generate synthetic tasks and rely purely on these synthetic tasks to fine-tune the LLM agent before evaluating on the existing out-of-distribution test tasks.
Hardware Specification Yes Table 5: Compute Usage for our main experiments in the self-improvement setting. All experiments are conducted on 8x A100 80G. The unit is the number of hours on 8x A100 80G.
Software Dependencies No The paper mentions various RL algorithms (REINFORCE, DPO, PPO, GRPO) and models (Llama-3.1-8B-Instruct, Llama-3.1-70B-Instruct) used, and provides their hyperparameters. However, it does not explicitly list specific software libraries or frameworks with version numbers (e.g., PyTorch 1.x, Python 3.x) that would be needed to replicate the experimental setup.
Experiment Setup Yes F Hyperparameters For reproducibility, we have included the hyperparameters for different RL algorithms as used in Table 1 and Figure 3. We found that Rejection Fine-Tuning and DPO are relatively stable with respect to hyperparameter choices while online RL methods PPO and GRPO require more careful hyperparameter tuning. Table 6: Hyperparameters for different RL algorithms.