Pareto Policy Adaptation

Authors: Panagiotis Kyriakis, Jyotirmoy Deshmukh, Paul Bogdan

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate our method in a series of reinforcement learning tasks. 5 EXPERIMENTAL EVALUATION In this section, we evaluate the performance of our proposed method.
Researcher Affiliation Academia Panagiotis Kyriakis University of Southern California Los Angeles, USA pkyriaki@usc.edu Jyotirmoy V. Deshmukh University of Southern California Los Angeles, USA jdeshmuk@usc.edu Paul Bogdan University of Southern California Los Angeles, USA pbogdan@usc.edu
Pseudocode Yes Algorithm 1: Multi-Objective Policy Gradient
Open Source Code No The paper does not contain an explicit statement about releasing its own source code or provide a link to a code repository for the methodology described.
Open Datasets Yes Domains: We evaluate on 4 environments (details given in the Appendix): (a) Multi-Objective Grid World (MOWG): a variant of the classical gridworld, (b) Deep Sea Treasure (DST): a slightly modified version of classical multi-objective reinforcement learning enviroment (51), (c) Mulit Objective Super Mario (MOSM): modified, multi-objective variant of the popular video game that has a 5-dimensional reward signal and (d) Multi-Objective Mu Jo Co (MOMU): a modified version of the Mu Jo Co physics simulator, focusing on locomotion tasks. Our implementation uses modified versions of 4 Open AI Gym enviroments (Fig. 5).
Dataset Splits No The paper does not explicitly specify exact percentages or sample counts for training, validation, and test dataset splits for reproducibility. It describes environments and training procedures, but not formal data splits.
Hardware Specification Yes We run all of our simulations in the Google Cloud Platorm using 48 v Cores and one NVIDIA Tesla T4 GPU.
Software Dependencies No The paper mentions “Py Torch” and “torch-ac package” and “Open AI Gym” but does not specify their version numbers or any other software dependencies with version numbers.
Experiment Setup Yes We set the GAE parameter to λ = 0.95 and the discount factor to γ = 0.99. We use the Adam optimizer (β1 = 0.9, β2 = 0.999) with a learning rate of 0.0001. For each 512 frames we perform one update of the network parameters iterating over 10 epochs of the collected data and using a mini-batch size of 64.