Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Towards Large-Scale In-Context Reinforcement Learning by Meta-Training in Randomized Worlds

Authors: Fan Wang, Pengtao Shao, Yiming Zhang, Bo Yu, Shaoshan Liu, Ning Ding, Yang Cao, Yu Kang, Haifeng Wang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our results demonstrate that, with a sufficiently large scale of Any MDP tasks, the proposed model can generalize to tasks that were not considered in the training set through versatile in-context learning paradigms. The scalable task set provided by Any MDP also enables a more thorough empirical investigation of the relationship between data distribution and ICRL performance. We further show that the generalization of ICRL potentially comes at the cost of increased task diversity and longer adaptation periods. This finding carries critical implications for scaling robust ICRL capabilities, highlighting the necessity of diverse and extensive task design, and prioritizing asymptotic performance over few-shot adaptation.
Researcher Affiliation Collaboration 1University of Science and Technology of China, Heifei, China 2Shenzhen Institute of Artificial Intelligence and Robotics for Society, Shenzhen, China 3Anhui Province Key Laboratory of Intelligent Low-Carbon Information Technology and Equipment 4Baidu Inc, Beijing, China
Pseudocode Yes Algorithm 1 elaborates on the detailed procedural generation of Any MDP tasks. Algorithm 2 Data Synthesis Pipeline Algorithm 3 Meta-Training Process Algorithm 4 Evaluation Process
Open Source Code Yes https://github.com/Future AGI/Xenoverse/tree/main/xenoverse/anymdp https://github.com/airs-cuhk/airsoul/tree/main/projects/Omni RL
Open Datasets Yes To this end, we collect a dataset Dtra(T (ns, na)) comprising 512K sequences for training, where ns [16, 128], na = 5. The length of each sequence T is 12K, resulting in a total of 6B time steps. For testing, we independently sample tasks Ttst with ns {1, 16, 32, 64, 128}, ensuring each ns group contains 256 tasks. We also evaluate our model, namely Omni RL, on both unseen Any MDP tasks in Figure 5, Gymnasium tasks, and Dark Room [13] in Figure 16, and those performances are shown in Table 1.
Dataset Splits Yes We first validate the representational capability of Any MDP tasks as universal MDPs. To this end, we collect a dataset Dtra(T (ns, na)) comprising 512K sequences for training, where ns [16, 128], na = 5. The length of each sequence T is 12K, resulting in a total of 6B time steps. For testing, we independently sample tasks Ttst with ns {1, 16, 32, 64, 128}, ensuring each ns group contains 256 tasks.
Hardware Specification Yes The meta-training process is primarily conducted using 8 Nvidia Tesla A800 GPUs. We use a batch size of 5 per GPU, divided into segments (chunks) of 2K steps each. We optimize using the Adam W algorithm with a learning rate that decays from a peak value of 2 10 4. The average time cost per iteration is 8 seconds for trajectories with T = 12K, and this cost increases linearly with sequence length. For more details please check Appendix C.2. For the causal sequence model, we evaluate four architectures: RWKV-7 [70], Gated Delta-Net (GDN) [72], Gated Self-Attention (GSA) [67], Mamba2 [73]. The test results are largely consistent with the conclusions reported in language processing (RWKV-7 GDN > GSA > Mamba2, with details in Appendix D.1), demonstrating the capability of Any MDP to serve as a benchmark for long-term sequence modeling. Therefore, we select RWKV-7 for subsequent experiments.
Software Dependencies No The paper mentions several models and algorithms like 'Adam W algorithm', 'RWKV-7', 'Gated Delta-Net (GDN)', 'Gated Self-Attention (GSA)', 'Mamba2', and 'flash-linear-attention'. However, it does not provide specific version numbers for general software dependencies such as Python, PyTorch, or CUDA.
Experiment Setup Yes The meta-training process is primarily conducted using 8 Nvidia Tesla A800 GPUs. We use a batch size of 5 per GPU, divided into segments (chunks) of 2K steps each. We optimize using the Adam W algorithm with a learning rate that decays from a peak value of 2 10 4. The average time cost per iteration is 8 seconds for trajectories with T = 12K, and this cost increases linearly with sequence length.