Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Ad Hoc Teamwork via Offline Goal-Based Decision Transformers

Authors: Xinzhi Zhang, Hohei Chan, Deheng Ye, Yi Cai, Mengchen Zhao

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experimental results show that TAGET significantly outperforms existing solutions to AHT in the offline setting. (...) 5. Experiments
Researcher Affiliation Collaboration 1School of Software Engineering, South China University of Technology, Guangzhou, China 2Tencent, Shenzhen, China. Correspondence to: Mengchen Zhao <EMAIL>.
Pseudocode Yes D. Pseudocode of Algorithm Algorithm 1 demonstrates our trajectory mirroring strategy for pre-processing the offline dataset. Algorithm 2 demonstrates the offline training process of TAGET. Algorithm 3 illustrates the online testing process of TAGET.
Open Source Code No The paper does not provide explicit statements or links regarding the availability of open-source code for the described methodology.
Open Datasets No To train our model in an offline setting, we utilize precollected interaction trajectories. To ensure the model s adaptability to diverse teammate strategies, we adopt the Soft-Value Diversity (SVD) method proposed in CSP (Ding et al., 2023) to collect data.
Dataset Splits Yes We trained four distinct populations of multi-agent reinforcement learning (MARL) policies for each environment. From these, one population was randomly sampled as the testing teammate set, while the remaining three were used to collect interaction trajectories for the offline dataset.
Hardware Specification No The paper does not provide specific hardware details (exact GPU/CPU models, processor types, or memory amounts) used for running its experiments.
Software Dependencies No The paper mentions using Adam W optimizer, but does not provide specific software dependencies (e.g., library or solver names with version numbers) needed to replicate the experiment.
Experiment Setup Yes Our model adopts the Decision Transformer (DT) backbone, with the following configurations: an embedding dimension of 64, context window length K = 30, 2 transformer layers with 1 attention head each, ReLU activation, and a dropout rate of 0.3. The network is optimized using Adam W with a learning rate of 0.01, batch size of 2048, and a weight decay of 0.0001. There are several task-specific coefficients to balance different learning objectives in our training loss. Specifically, we set the weighting parameters as follows: α = 0.0001, β = 100, γ = 100, and σ = 0.001