Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

DAIL: Beyond Task Ambiguity for Language-Conditioned Reinforcement Learning

Authors: Runpeng Xie, Quanwei Wang, Hao Hu, Zherui Zhou, Ni Mu, Xiyun Li, Yiqin Yang, Shuang Xu, Qianchuan Zhao, Bo Xu

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct extensive experiments on both structured observation [10] and visual observation [54] benchmarks. This design is to validate the external validity of the DAIL agent, progressing from less complex structured inputs to more complex and expressive visual observations. The experimental results show that DAIL outperforms the state-of-the-art language-conditioned RL methods in both benchmarks. Further, the visualization analysis demonstrates that DAIL can learn a non-ambiguous task representation compared with baselines. Our main contributions are summarized as follows: First, we highlight the critical issue of task ambiguity and empirically analyze the limitations of current mainstream methods. We define the task distinction in our setting and analyze the sample complexity to avoid task ambiguity theoretically. Second, we propose DAIL, a simple yet efficient language-conditioned learning framework, which addresses the task ambiguity issue based on distributional policy and semantic alignment. Lastly, we conduct extensive experiments to show that DAIL significantly outperforms conventional language-conditioned methods. The results indicate that by improving task discrimination, we can effectively mitigate the task ambiguity issue, thereby broadening the application of language-conditioned RL.
Researcher Affiliation	Collaboration	Runpeng Xie 1, Quanwei Wang 2, Hao Hu3, Zherui Zhou4, Ni Mu2, Xiyun Li5, Yiqin Yang 1, Shuang Xu1, Qianchuan Zhao2, Bo Xu 1 1The Key Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, China 2Department of Automation, Tsinghua University 3Moonshot AI 4Department of Computer Science and Engineering, Washington University 5Tecent AI Lab
Pseudocode	Yes	We show the overall framework of DAIL in Figure 1 and Algorithm 1 in Appendix B. B Algorithm Algorithm 1 Distributional Aligned Learning
Open Source Code	Yes	Our implementation is available at https://github.com/Runpeng Xie/Distributional-Aligned-Learning.
Open Datasets	Yes	We conduct extensive experiments on both structured observation [10] and visual observation [54] benchmarks. Baby AI [10] is a language learning research platform with different levels of tasks, shown on the Left of Figure 4. We choose level Synth Loc for evaluation... ALFRED [54] benchmarks sequential decision-making tasks involving household activities (e.g, cleaning, heating food) through language instructions and first-person vision, shown on the Right of Figure 4.
Dataset Splits	Yes	We divide the task set into two subsets, designating approximately 60% of the tasks as in-distribution tasks. All trajectories in the offline dataset are collected under in-distribution instructions, while tasks encountered during testing outside this set are considered out-of-distribution tasks. To construct the offline dataset, we collect three types of data: expert data, gathered by a pre-designed bot within the environment; medium data, collected by a well-trained agent; and random data. The built-in bot has access to global information to accomplish every possible task with a near-optimal solution. We train an IL agent following Baby AI 1.1 [23], the state-of-the-art model proposed by the original environmental authors. Trained on a dataset of 100k expert trajectories, it achieved approximately 87.9% success rate across all tasks. Random agent achieves a 10.5% success rate during data collection. We conduct a high-quality dataset with 50k expert trajectories, 50k IL agent trajectories, and 25k random trajectories; a medium-quality dataset with 12.5k expert trajectories, 25k IL agent trajectories, and 40k random trajectories. All the trajectories in the dataset are generated under in-distribution instructions. To simulate the presence of noisy data in real-world applications, we augment the training set with 30k random-agent trajectories, resulting in 97896 total trajectories with 53442 unique instructions across 108 household scenes.
Hardware Specification	No	The information or computer resources are provided in Appendix F. For ALFRED, we use the original encoding framework in ALFRED [54] for all implemented methods, which contains two sequential blocks: each block contains a 1 1 convolutional layer, followed by batch normalization and Re LU activation. The features are then flattened and projected to 512 dimensions through a linear layer, producing a 512-dimensional vector as the observation encoder s final output.
Software Dependencies	No	All models are implemented with Py Torch, and trained with a batch size of 64, using the Adam optimizer [27] at a learning rate of 3e-4. All layers in the networks utilize Py Torch s default weight initialization, and the network outputs fixed-dimensional embeddings suitable for downstream tasks.
Experiment Setup	Yes	All models are implemented with Py Torch, and trained with a batch size of 64, using the Adam optimizer [27] at a learning rate of 3e-4. All layers in the networks utilize Py Torch s default weight initialization, and the network outputs fixed-dimensional embeddings suitable for downstream tasks. In Baby AI experiments, all methods were trained for 50 epochs over 3 seeds. And in the ALFRED experiments, all methods were trained for 20 epochs over 3 seeds following [54]. As for DAIL, we fix α = 2 and λ = 0.2 except for the toy experiment and ablation experiment of λ. We use VMAX = VMIN = 20, M = 51 in all our experiments following [29].