Fast Peer Adaptation with Context-aware Exploration
Authors: Long Ma, Yuanfei Wang, Fangwei Zhong, Song-Chun Zhu, Yizhou Wang
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate our method on diverse testbeds that involve competitive (Kuhn Poker), cooperative (PO-Overcooked), or mixed (Predator-Prey-W) games with peer agents. We demonstrate that our method induces more active exploration behavior, achieving faster adaptation and better outcomes than existing methods. |
| Researcher Affiliation | Academia | 1Academy for Advanced Interdisciplinary Studies, Peking University 2Nat l Key Laboratory of General Artificial Intelligence, BIGAI&PKU 3Center on Frontiers of Computing Studies, School of Computer Science, Peking University 4School of Intelligence Science and Technology, Peking University 5Inst. for Artificial Intelligence, Peking University 6Nat l Eng. Research Center of Visual Technology, Peking University. |
| Pseudocode | Yes | Algorithm 1 Training Procedure of PACE |
| Open Source Code | Yes | 1Project page: https://sites.google.com/view/ peer-adaptation |
| Open Datasets | No | The paper describes generating its own peer policies (e.g., 'we sample 40 P2 policies for training and 10 P2 policies for testing' for Kuhn Poker) and does not provide access information for a publicly available, pre-existing dataset. |
| Dataset Splits | No | The paper mentions 'training' and 'testing' pools of peer policies but does not explicitly describe a separate 'validation' set or specific splits for validation, such as percentages or sample counts. |
| Hardware Specification | Yes | The training of PACE takes 12 hours with 80 processes on a single Titan Xp GPU. |
| Software Dependencies | No | For all baselines and ablations, we use PPO (Schulman et al., 2017; Kostrikov, 2018) as the RL training algorithm. However, specific version numbers for software dependencies like PyTorch, Python, or CUDA are not provided. |
| Experiment Setup | Yes | Table 4, 5, and 6 list the hyperparameters related to architectures and PPO training for Kuhn Poker, PO-Overcooked, and Predator-Prey-W, respectively. These include Learning Rate, PPO Clip ϵ, Entropy Coefficient, γ, GAE λ, Batch Size, # Update Epochs, # Mini Batches, Gradient Clipping (L2), Activation Function, Actor/Critic Hidden Dims, fθ Hidden Dims, gθ Hidden Dims. |