Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

RF-Agent: Automated Reward Function Design via Language Agent Tree Search

Authors: Ning Gao, Xiuhui Zhang, Xingyu Jiang, Mukang You, Mohan Zhang, Yue Deng

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Outstanding experimental results in 17 diverse low-level control tasks demonstrate the effectiveness of our method. The source code is available at https://github.com/deng-ai-lab/RF-Agent.
Researcher Affiliation	Academia	Ning Gao, Xiuhui Zhang, Xingyu Jiang, Mukang You, Mohan Zhang, Yue Deng Beihang University 37 Xueyuan Road, Haidian District, Beijing EMAIL EMAIL, EMAIL
Pseudocode	Yes	Algorithm 1 provides a pseudo-code for the proposed RF-Agent method, which can be combined with Fig.2 to further familiarize yourself with the entire RF-Agent process.
Open Source Code	Yes	The source code is available at https://github.com/deng-ai-lab/RF-Agent.
Open Datasets	Yes	We test RF-Agent in two low-level control environments: Isaac Gym[31] and Bi-Dex Hands[9], encompassing 8 control agents and 17 diverse tasks.
Dataset Splits	No	The paper uses simulation environments (Isaac Gym, Bi-Dex Hands) where data is generated through interaction, not traditional static datasets with predefined splits. There is no explicit mention of training/test/validation splits for these environments' data.
Hardware Specification	Yes	We deployed RF-Agent on a 4 Nvidia Geforce RTX3090 cards with 128 core CPUs and 256Gi B memory server.
Software Dependencies	No	The paper mentions using 'GPT-4o-mini-0718 and GPT-4o-0806 models' which are LLM models, and 'well-tuned PPO[30]' with 'rl-games[30]' as the learning algorithm, but it does not specify version numbers for general software dependencies like Python, PyTorch, or CUDA, which are typically required for reproducibility.
Experiment Setup	Yes	In Isaac Gym, we set the total limit to 80 and use the GPT-4o-mini-0718 and GPT-4o-0806 models for LLM implementation[21]. For the Bidex tasks, to evaluate the search performance of the method itself under complex control tasks, we increase the upper limit to 512 and use only the GPT-4o-mini model. ... In our RF-Agent, five different action types are set in the expansion stage, as well as the initial initialization action... Thus, we configure the actions as [2, 2, 2, 1, 1] per expansion to make the ratio of nodes that utilize local and global information at 1 : 1. The initial value of λ is set to 0.4 across all tasks, vself is constrained within the range of [ 1, 1] in order to have a more obvious distinction after softmax, k is randomly sampled from [2, 4] to control the number of nodes sampled in ac3, ar4 and ad5, and η is set to 0.7.