Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

RLVR-World: Training World Models with Reinforcement Learning

Authors: Jialong Wu, Shaofeng Yin, Ningya Feng, Mingsheng Long

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate substantial performance gains on both language- and video-based world models across domains, including text games, web navigation, and robot manipulation. Our experiments show that RLVR can effectively fine-tune LLMs as language world models, yielding significant improvements... As shown in Table 2, the world model of the Internet can also be enhanced substantially by RLVR... As shown in Table 3, RLVR-World significantly improves the base model across all visual metrics on RT-1...
Researcher Affiliation Academia Jialong Wu1, Shaofeng Yin1,2, Ningya Feng1, Mingsheng Long1 1School of Software, BNRist, Tsinghua University 2Zhili College, Tsinghua University EMAIL, EMAIL
Pseudocode No The paper describes the methodology in prose and mathematical equations (e.g., Eq. 1 and Eq. 2) and provides architectural details, but it does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes Code, datasets, models, and video samples are available at the project website: https://thuml.github.io/RLVR-World.
Open Datasets Yes We use Byte Sized32-State-Prediction [65], a dataset of text game state transitions... We further evaluate our approach on more realistic web navigation scenarios, using a web page state transition dataset collected by WMA [8] from the Web Arena benchmark [80]... We primarily use the RT-1 robotic manipulation dataset [5] for our experiments... To compare with state-of-the-art models, we also include tabletop pushing (Push T) [9] and deformable object manipulation (Rope and Granular) [77] datasets from DINO-WM [79].
Dataset Splits Yes The dataset contains 76,369 transitions from 31 distinct text games, with 2954 high-quality transitions selected for testing... For training and testing, we select a 7K-sample subset... We allocate 99% of this subset for training and reserve the remaining 1% for testing... We use the RT-1 robotic manipulation dataset [5] for our experiments, which contains 87,212 tabletop teleoperation trajectories... 99% of trajectories are used as the training set and 1% are left as the test set.
Hardware Specification Yes The SFT phase uses 4 80G A100 GPUs for 6.5 hours of training. The RLVR phase is conducted on 8 80G A100 GPUs over 22.5 hours... SFT is performed on 8 40G A100 GPUs over 17 hours, while the RLVR training is conducted on 8 80G H100 GPUs for 25 hours... All experiments in this domain are conducted on a 40G A100 GPU cluster.
Software Dependencies No The paper mentions several frameworks and tools used, such as "verl framework8 [58]" and "accelerate9", but it does not specify concrete version numbers for these software dependencies as required.
Experiment Setup Yes Experimental details can be found in Appendix A... Hyperparameters for training are provided in Table 5... Hyperparameters of architectures and training process for robot manipulation trajectory prediction are listed in Table 7 and 8.