Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Bi-Level Knowledge Transfer for Multi-Task Multi-Agent Reinforcement Learning
Authors: Junkai Zhang, Jinmin He, Yifan Zhang, Yifan Zang, Ning Xu, Jian Cheng
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We exploit different environments to conduct a large number of experiments, including Star Craft II Micromanagement (SMAC) [31] and Multi-Agent Particle Environment (MPE) [22]. We conduct five random seeds for each algorithm and evaluate them with 32 environments. 5.2 Performance Comparisons Baselines We compare Bi KT against representative baselines. Table 1: The performance of different methods in Task set Marine-Hard. |
| Researcher Affiliation | Collaboration | Junkai Zhang1,2, Jinmin He1,2, Yifan Zhang1,2,3 , Yifan Zang4, Ning Xu1, Jian Cheng1,2,5 1 C2DL, Institute of Automation, Chinese Academy of Sciences 2 School of Artificial Intelligence, University of Chinese Academy of Sciences 3University of Chinese Academy of Sciences, Nanjing 4Beijing Institute of Astronautical Systems, 5Ai Ri A |
| Pseudocode | Yes | A PSEUDOCODE of our method The pseudocode of our method is detailed in 1. Algorithm 1 Bi KT for Multi-Task MARL |
| Open Source Code | Yes | Answer: [Yes] We provide code in the supplementary material. Justification: We provide code in the supplementary material. |
| Open Datasets | Yes | We exploit different environments to conduct a large number of experiments, including Star Craft II Micromanagement (SMAC) [31] and Multi-Agent Particle Environment (MPE) [22]. Following the experimental protocol proposed by [41], we leverage the task set and corresponding offline datasets they released: Marine-Hard, Marine-Easy, and Stalker Zealot. In our experiments, we use the same dataset collected by ODIS for the fair comparison. Our training uses offline datasets provided by [41]. |
| Dataset Splits | Yes | We denote Tsrc = {T n src}Nsrc n=1 and Ttgt = {T n tgt}Ntgt n=1 as the sets of source and unseen target tasks, respectively, where Nsrc and Ntgt represent the task numbers. Source tasks {T n src}Nsrc n=1 are associated with an offline dataset Dsrc = {Dn src}Nsrc n=1 collected by a pre trained policy. A trajectory from Dn src is defined as (sn 0, an 0 , rn 0 , . . . , sn H), where H is the length of trajectory, sn t and an t are the joint state and action at time t, and rn t = Rn(sn t , an t ) is the reward. Under the multi-task generalization setting, our objective is to learn a general multi-agent policy π that maximizes the expected discounted return across all tasks, shown in Eq. 1. After training, π is evaluated on Ttgt without further fine tuning. |
| Hardware Specification | Yes | For computing resources, we utilize the Intel(R) Xeon(R) Gold 5220 CPU and NVIDIA TITAN RTX GPU in the experiments. Each experiment in per task set lasts on average for 8 hours. |
| Software Dependencies | No | The paper does not explicitly list specific software dependencies with version numbers (e.g., Python, PyTorch, CUDA versions). |
| Experiment Setup | Yes | C Implementation Details In this section, we will provide the model structure, the hyperparameters, and other training details of ODIS. We present each part of Bi KT in the following sections. Table 8: Hyperparameters of our method. Hyperparameter Value Individual skill dimension Ns 4 Tactic embedding Nc 64 The number of tactics in C: K 16 Hidden layer dimension of BDT 64 The multi-head number of BDT 2 The content length of BDT 10 Optimizer Adam Training steps for Lskill 15000 Training steps for Ltactic 8000 Training steps for Lpolicy 30000 Batch size 32 learning rate l1 0.0004 learning rate l2 0.0001 learning rate l3 0.0002 β1 1 β2 0.01 α 0.05 |