Peer Learning: Learning Complex Policies in Groups from Scratch via Action Recommendations

Authors: Cedric Derstroff, Mattia Cerrato, Jannis Brugger, Jan Peters, Stefan Kramer

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Eventually, we analyze the learning behavior of the peers and observe their ability to rank the agents performance within the study group and understand which agents give reliable advice. Further, we compare peer learning with single agent learning and a stateof-the-art action advice baseline. We show that peer learning is able to outperform single-agent learning and the baseline in several challenging discrete and continuous Open AI Gym domains.
Researcher Affiliation Academia 1 Technische Universit at Darmstadt 2 Hessian Center for Artificial Intelligence (hessian.AI) 3 Johannes Gutenberg-Universit at Mainz 4 German Research Center for AI (DFKI) 5 Centre for Cognitive Science {cedric.derstroff, jannis.brugger, jan.peters}@tu-darmstadt.de, mcerrato@uni-mainz.de, kramer@informatik.uni-mainz.de
Pseudocode No The paper contains mathematical equations (e.g., (2), (3), (4), (5), (8)) but no structured pseudocode or algorithm blocks with clear labels like "Algorithm" or "Pseudocode".
Open Source Code Yes Our Python code can be found on Git Hub2 and works with several off-policy RL algorithms that make use of a Qfunction especially but not limited to actor-critic methods.
Open Datasets Yes Being among the small group of approaches that work for continuous action spaces, we used several Mu Jo Co (Todorov, Erez, and Tassa 2012) Open AI Gym environments (Brockman et al. 2016), i.e., Half Cheetah-v4, Walker2d-v4, Ant-v4 and Hopper-v4.
Dataset Splits No The paper refers to using Open AI Gym environments but does not provide specific dataset split information (exact percentages, sample counts, or detailed splitting methodology) for training, validation, or testing.
Hardware Specification No The paper mentions running experiments on "supercomputer MOGON 2" but does not provide specific hardware details such as exact GPU/CPU models, processor types with speeds, or memory amounts.
Software Dependencies No The paper mentions using Python code, Soft Actor Critic (SAC) algorithm, and Deep Q-networks (DQN) but does not provide specific version numbers for these or other software dependencies.
Experiment Setup Yes In this setting, we use a group size of 4 peers. For all experiments, we used 10 random seeds except for the Roomv27 where we used 15. where α is a trust learning rate separate from the base learning algorithm s learning rate. In our experiments, we keep this fixed at 0.99. For τ0 = 1 and λ = 0, the Boltzmann distribution...