Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Policy Gradient Methods in the Presence of Symmetries and State Abstractions
Authors: Prakash Panangaden, Sahand Rezaei-Shoshtari, Rosie Zhao, David Meger, Doina Precup
JMLR 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate DHPG on continuous control tasks from DM Control on pixel observations. Importantly, to reliably evaluate our algorithm against the baselines and to correctly capture the distribution of results, we follow the best practices proposed by Agarwal et al. (2021) and report the interquartile mean (IQM) and performance profiles aggregated on all tasks over 10 random seeds. While our baseline results are obtained using the official code, when possible5, some of the results may differ from the originally reported ones due to the difference in the seed numbers and our goal to present a faithful representation of the true performance distribution (Agarwal et al., 2021). |
| Researcher Affiliation | Collaboration | Prakash Panangaden EMAIL School of Computer Science, Mc Gill University and Mila Quebec AI Institute Montreal, QC, Canada... Doina Precup EMAIL School of Computer Science, Mc Gill University and Mila Quebec AI Institute and Deep Mind Montreal, QC, Canada |
| Pseudocode | Yes | Algorithm 1 describes the pseudo-code of DHPG algorithms. Denoting pixel observations as ot, the underlying states as st, and the abstract states as st, the main components of the DHPG algorithm are: the MDP homomorphism map hφ,η =(fφ(st),gη(st,at)), pixel encoder Eµ(ot), actual critic Qψ(st,at) and policy π θ(at st), abstract critic Qψ(st,at) and policy πθ(at st), reward predictor Rρ(st), and probabilistic transition dynamics τ ν(st+1 st,at) which outputs a Gaussian distribution. Finally, we leverage target critic networks Qψ and Qψ for a more stable training and use a vanilla replay buffer (Mnih et al., 2013; Lillicrap et al., 2015). Algorithm 1 Deep Homomorphic Policy Gradient (DHPG) |
| Open Source Code | Yes | Our code for DHPG and the novel environments with continuous symmetries are publicly available1. 1. https://github.com/sahandrez/homomorphic policy gradient |
| Open Datasets | Yes | Our code for DHPG and the novel environments with continuous symmetries are publicly available1. 1. https://github.com/sahandrez/homomorphic policy gradient... We demonstrate the effectiveness of our method on our environments, as well as on challenging visual control tasks from the Deep Mind Control Suite. |
| Dataset Splits | No | The paper mentions evaluating on DM Control tasks with pixel observations and reports aggregated RLiable metrics over 14 tasks with 10 random seeds. However, it does not specify explicit training/validation/test dataset splits or methodologies for partitioning observations within these environments, as is typical for static datasets. In RL, the 'data' is generated through interaction, and reproducibility is often ensured by environment setup and random seeds for runs, not by dataset splits. |
| Hardware Specification | Yes | Our code is publicly available at https://github.com/sahandrez/homomorphic policy gradient. We implemented our method in Py Torch (Paszke et al., 2019) and results were obtained using Python v3.8.10, Py Torch v1.10.0, CUDA 11.4, and Mujoco 2.1.1 (Todorov et al., 2012) on A100 GPUs on a cloud computing service. |
| Software Dependencies | Yes | Our code is publicly available at https://github.com/sahandrez/homomorphic policy gradient. We implemented our method in Py Torch (Paszke et al., 2019) and results were obtained using Python v3.8.10, Py Torch v1.10.0, CUDA 11.4, and Mujoco 2.1.1 (Todorov et al., 2012) on A100 GPUs on a cloud computing service. |
| Experiment Setup | Yes | Table 1 present the hyperparameters used in our experiments. The hyperparameters are all adapted from Dr Q-v2 (Yarats et al., 2021a) without any further hyperparameter tuning. We have kept the same set of hyperparameters across all algorithms and tasks, except for the walker domain which similarly to Dr Q-v2 (Yarats et al., 2021a), we used n-step return of n = 1 and mini-batch size of 512. The core RL components (actor and critic networks), as well as the components of DHPG (state and action encoders, transition and reward models) are all MLP networks with the Re LU activation function and one hidden layer with dimension of 256. The image encoder is based on the architecture of Dr Q-v2 which is itself based on SAC-AE (Yarats et al., 2021b) and consists of four convolutional layers of 32 × 3 × 3 with Re LU as their activation functions, followed by a one-layer fully-connected neural network with layer normalization (Ba et al., 2016) and tanh activation function. The stride of the convolutional layers are 1, except for the first layer which has stride 2. |