Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Text-to-Decision Agent: Offline Meta-Reinforcement Learning from Natural Language Supervision

Authors: Shilin Zhang, Zican Hu, Wenhao Wu, Xinyi Xie, Jianxiang Tang, Chunlin Chen, Daoyi Dong, Yu Cheng, Zhenhong Sun, Zhi Wang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We comprehensively evaluate and analyze our method on popular benchmarking domains across datasets of varying qualities, aiming to answer the following research questions: Can T2DA achieve consistent performance gain on zero-shot generalization capacity to unseen tasks? We compare it to various types of strong baselines, including offline meta-RL, in-context RL, and language-conditioned policy learning approaches. (Sec. 4.1) What is the contribution of each component to T2DA s performance? We ablate both the T2DA-D and T2DA-T architectures to analyze the respective impact of world model pre-training, contrastive language-decision pre-training, and language supervision. (Sec. 4.2) How robust is T2DA across diverse settings? We evaluate T2DA against baselines using offline datasets of varying qualities, and assess T2DA s performance when initializing the text encoder from different LLMs. (Sec. 4.3)
Researcher Affiliation Academia 1 Nanjing University 2 University of Technology Sydney 3 The Chinese University of Hong Kong 4 Australian National University 5 University of New South Wales
Pseudocode Yes Corresponding algorithm pseudocodes are given in Appendix A. ... Algorithm 1: Pre-training the generalized world model ... Algorithm 2: Contrastive Language-Decision Pre-training ... Algorithm 3: Model Training of Text-to-Decision Diffuser ... Algorithm 4: Model Training of Text-to-Decision Transformer ... Algorithm 5: Zero-Shot Evaluation of Text-to-Decision Diffuser ... Algorithm 6: Zero-Shot Evaluation of Text-to-Decision Transformer
Open Source Code Yes Our code is available at https://github.com/NJU-RL/T2DA.
Open Datasets Yes Comprehensive experiments on Mu Jo Co and Meta-World benchmarks show that T2DA facilitates high-capacity zero-shot generalization and outperforms various types of baselines. ... We evaluate T2DA on three benchmarks that are widely adopted to assess generalization capacities of RL algorithms: i) the 2D navigation Point-Robot; ii) the multi-task Mu Jo Co locomotion control, containing Cheetah-Vel and Ant-Dir; and iii) the Meta-World platform for robotic manipulation, where a robotic arm is designed to perform a wide range of manipulation tasks, such as close faucet, lock door, open door, and press button. ... D4RL: Datasets for deep data-driven reinforcement learning. ar Xiv preprint ar Xiv:2004.07219, 2020.
Dataset Splits Yes For each domain of Point-Robot, Cheetah-Vel, and Ant-Dir, we sample 50 tasks in total and split them into 45 training tasks and 5 test tasks. For Meta-World, we use 18 training tasks and 4 test tasks as detailed as shown in Table 4.
Hardware Specification Yes We train our models on one Nvidia RTX4080 GPU with the Intel Core i9-10900X CPU and 256G RAM.
Software Dependencies No The paper mentions several frameworks and models (e.g., CLIP [16], BERT [64], T5 [56], Decision Diffuser [58], Decision Transformer [59]) but does not provide specific version numbers for core software dependencies like Python, PyTorch, or CUDA.
Experiment Setup Yes The paper provides detailed hyperparameters and configurations in tables. For instance, Table 5 lists 'Hyperparameters of SAC used to collect multi-task offline datasets' including training steps, learning rate, and discount factor. Table 6 details 'Configurations and hyperparameters in the training process of T2DA-T' with values for layers num, embedding dim, and learning rate. Table 7 similarly presents 'Configurations and hyperparameters in the training process of T2DA-D' including DiT layers num, diffusion steps, and learning rate.