Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Mixture-of-Experts Meets In-Context Reinforcement Learning

Authors: Wenhao Wu, Fuhong Liu, Haoru Li, Zican Hu, Daoyi Dong, Chunlin Chen, Zhi Wang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Comprehensive experiments show that T2MIR significantly facilitates in-context learning capacity and outperforms various types of baselines. We comprehensively evaluate our methods on popular benchmarking domains using various qualities of offline datasets.
Researcher Affiliation Academia 1Nanjing University 2University of Technology Sydney EMAIL EMAIL EMAIL
Pseudocode Yes A Algorithm Pseudocodes 17 Based on the implementations in Sec. 4, this section gives the brief procedures of T2MIR. Algorithm 1 and Algorithm 3 show the pipline of training T2MIR-AD and T2MIR-DPT, respectively. We train all components including token-wise Mo E, task-wise Mo E and their routers together with the main causal transformer network end to end. Then, Algorithm 2 and Algorithm 4 show the evaluation phase, where the agent can improve its performance on test tasks by interacting with the environments without any parameter updates.
Open Source Code Yes Our code is available at https://github.com/NJU-RL/T2MIR.
Open Datasets No Appendix D presents more details about environments and dataset construction. We construct three datasets with different qualities: Mixed, Medium-Expert, and Medium. For discrete environment Dark Room, we use the expert policy to collect datasets by progressively reducing the noise, as in [14]. For Point-Robot and Mu Jo Co environments, we employ the soft actor-critic (SAC)[62] algorithm to train a policy independently for each task...For Meta-World environments, we use the Proximal Policy Optimization (PPO)[63] algorithm implementation provided by Stable Baselines 3[64]... During training, we periodically save the policy checkpoints and use them to generate various qualities of offline datasets as Mixed, Medium-Expert, and Medium.
Dataset Splits Yes We evaluate T2MIR on four benchmarks that are widely used in multi-task settings: i) the discrete environment Dark Room...We construct three datasets with different qualities: Mixed, Medium-Expert, and Medium. Appendix D presents more details about environments and dataset construction. Dark Room: ...we randomly sample 80 tasks as training tasks and hold out the remaining 20 for evaluation. Point-Robot: ...We randomly sample 45 goals as training tasks, and another 5 goals for evaluation.
Hardware Specification Yes Compute. We train our models on one Nvidia RTX4080 GPU with the Intel Core i9-10900X CPU and 256G RAM. The training process takes about 0.5-3 hours, depending on the complexity of the environments.
Software Dependencies No For Meta-World environments, we use the Proximal Policy Optimization (PPO)[63] algorithm implementation provided by Stable Baselines 3[64]... Akin to AMAGO [45], we adopt Flash Attention [56] to enable long context lengths on a single GPU...Each block employs a multi-head self-attention module followed by a feedforward network or a Mo E layer with GELU activation [66].
Experiment Setup Yes Table 6 and Table 7 show the detailed hyperparameters used for T2MIR-AD and T2MIR-DPT using Mixed datasets, respectively. Appendix G presents comprehensive hyperparameter analysis, including expert selection in token-wise and task-wise Mo E, Info NCE loss ratio, and the positioning of Mo E layers.