Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

RLZero: Direct Policy Inference from Language Without In-Domain Supervision

Authors: Harshit Sushil Sikchi, Siddhant Agarwal, Pranaya Jajoo, Samyak Parajuli, Caleb Chuck, Max Rudolph, Peter Stone, Amy Zhang, Scott Niekum

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments seek to understand the quality of behaviors that RLZero is able to produce in two settings (a) Language to Behavior, and (b) Cross-embodied Video to Behavior Generation. We note that none of the experiments below assume in-domain supervision such as annotations of trajectories in the environment with their task label, or expert demonstrations corresponding to the specified tasks either for training the video generation model or training the behavioral foundation model. ... Table 1 demonstrates the win rates by different methods when evaluated by GPT-4o-preview as well as distribution matching returns.
Researcher Affiliation Collaboration Harshit Sikchi ,1, Siddhant Agarwal ,1, Pranaya Jajoo ,2, Samyak Parajuli ,1, Caleb Chuck ,1, Max Rudolph ,1, Peter Stone ,1,3, Amy Zhang ,1, Scott Niekum ,4 1 The University of Texas at Austin, 2 University of Alberta 3 Sony AI, 4 UMass Amherst
Pseudocode Yes Algorithm 1 RLZero 1: Init: Pretrained Video Generation Model VM, Pretrained BFM ฯ€z, Offline Exploration Dataset d O 2: Given: text prompt t 3: Generate imagination video given the text prompt: {i1, i2, ..il} = VM(t) 4: Project the imagined frames to real observations using embedding similarity as in Eq 3. 5: Use Theorem 1 for zero-shot inference to obtain BFM({s1, s2, ..., sl}) = zimit and return ฯ€zimit.
Open Source Code Yes Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: We make the code available as supplementary material and will release it publically upon acceptance of the manuscript.
Open Datasets Yes Setup: We consider continuous control tasks from the Deep Mind control suite (Cheetah, Walker, Quadruped, Stickman) and Hum Env (3D Humanoid). ... We use continuous control environments from the Deep Mind Control Suite [72]. For Hum Env [42], we use an open-source BFM [74] for policy generation given a reward function and do not assume access to the dataset used for training the BFM.
Dataset Splits No For the environments Cheetah, Walker, Quadruped, and Stickman, we collect a dataset d O of environmental interactions of the form {s, a, s } purely by a random exploration algorithm RND [10] (10 million transitions for Stickman and 5 million transitions for the rest of the environments). In the Stickman environment, we additionally augment this dataset with replay buffers of the agent trained for walking and running to increase the diversity of behaviors, increasing the dataset size by another 2 million transitions. The detailed composition of the datasets can be found in Appendix B.3.
Hardware Specification Yes Our experiments were performed on NVIDIA-A40 and AMD EPYC 7763 64-Core Processor machine.
Software Dependencies No Our implementation follows closely from the author s codebase. Specifically, FB trains Forward, Backward, and Actor networks. The hyperparameters for our FB implementation are listed below: Implementation: We build upon the codebase for FB https://github.com/ facebookresearch/controllable_agent and implement all the algorithms under a uniform setup for network architectures and the same hyperparameters for shared modules across the algorithms.
Experiment Setup Yes The hyperparameters for our FB implementation are listed below: Table 3: Hyperparameters for zero-shot RL with FB. Hyperparameter Value Replay buffer size 5 106, 10 106 (for stickman) Representation dimension 128 Batch size 1024 Discount factor ฮณ 0.98 Optimizer Adam Learning rate 3 10 4 Momentum coefficient for target networks 0.99 Stddev ฯƒ for policy smoothing 0.2 Truncation level for policy smoothing 0.3 Number of gradient steps 2 106 Regularization weight for orthonormality loss (ensures diversity) 1 FB specific hyperparameters Hidden units (F) 1024 Number of layers (F) 3 Hidden units (b) 256 Number of layers (b) 2