Fast Peer Adaptation with Context-aware Exploration

Authors: Long Ma, Yuanfei Wang, Fangwei Zhong, Song-Chun Zhu, Yizhou Wang

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate our method on diverse testbeds that involve competitive (Kuhn Poker), cooperative (PO-Overcooked), or mixed (Predator-Prey-W) games with peer agents. We demonstrate that our method induces more active exploration behavior, achieving faster adaptation and better outcomes than existing methods.
Researcher Affiliation Academia 1Academy for Advanced Interdisciplinary Studies, Peking University 2Nat l Key Laboratory of General Artificial Intelligence, BIGAI&PKU 3Center on Frontiers of Computing Studies, School of Computer Science, Peking University 4School of Intelligence Science and Technology, Peking University 5Inst. for Artificial Intelligence, Peking University 6Nat l Eng. Research Center of Visual Technology, Peking University.
Pseudocode Yes Algorithm 1 Training Procedure of PACE
Open Source Code Yes 1Project page: https://sites.google.com/view/ peer-adaptation
Open Datasets No The paper describes generating its own peer policies (e.g., 'we sample 40 P2 policies for training and 10 P2 policies for testing' for Kuhn Poker) and does not provide access information for a publicly available, pre-existing dataset.
Dataset Splits No The paper mentions 'training' and 'testing' pools of peer policies but does not explicitly describe a separate 'validation' set or specific splits for validation, such as percentages or sample counts.
Hardware Specification Yes The training of PACE takes 12 hours with 80 processes on a single Titan Xp GPU.
Software Dependencies No For all baselines and ablations, we use PPO (Schulman et al., 2017; Kostrikov, 2018) as the RL training algorithm. However, specific version numbers for software dependencies like PyTorch, Python, or CUDA are not provided.
Experiment Setup Yes Table 4, 5, and 6 list the hyperparameters related to architectures and PPO training for Kuhn Poker, PO-Overcooked, and Predator-Prey-W, respectively. These include Learning Rate, PPO Clip ϵ, Entropy Coefficient, γ, GAE λ, Batch Size, # Update Epochs, # Mini Batches, Gradient Clipping (L2), Activation Function, Actor/Critic Hidden Dims, fθ Hidden Dims, gθ Hidden Dims.