Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Policy Gradient Methods in the Presence of Symmetries and State Abstractions

Authors: Prakash Panangaden, Sahand Rezaei-Shoshtari, Rosie Zhao, David Meger, Doina Precup

JMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate DHPG on continuous control tasks from DM Control on pixel observations. Importantly, to reliably evaluate our algorithm against the baselines and to correctly capture the distribution of results, we follow the best practices proposed by Agarwal et al. (2021) and report the interquartile mean (IQM) and performance proﬁles aggregated on all tasks over 10 random seeds. While our baseline results are obtained using the oﬃcial code, when possible5, some of the results may diﬀer from the originally reported ones due to the diﬀerence in the seed numbers and our goal to present a faithful representation of the true performance distribution (Agarwal et al., 2021).
Researcher Affiliation	Collaboration	Prakash Panangaden EMAIL School of Computer Science, Mc Gill University and Mila Quebec AI Institute Montreal, QC, Canada... Doina Precup EMAIL School of Computer Science, Mc Gill University and Mila Quebec AI Institute and Deep Mind Montreal, QC, Canada
Pseudocode	Yes	Algorithm 1 describes the pseudo-code of DHPG algorithms. Denoting pixel observations as ot, the underlying states as st, and the abstract states as st, the main components of the DHPG algorithm are: the MDP homomorphism map hφ,η =(fφ(st),gη(st,at)), pixel encoder Eµ(ot), actual critic Qψ(st,at) and policy π θ(at st), abstract critic Qψ(st,at) and policy πθ(at st), reward predictor Rρ(st), and probabilistic transition dynamics τ ν(st+1 st,at) which outputs a Gaussian distribution. Finally, we leverage target critic networks Qψ and Qψ for a more stable training and use a vanilla replay buﬀer (Mnih et al., 2013; Lillicrap et al., 2015). Algorithm 1 Deep Homomorphic Policy Gradient (DHPG)
Open Source Code	Yes	Our code for DHPG and the novel environments with continuous symmetries are publicly available1. 1. https://github.com/sahandrez/homomorphic policy gradient
Open Datasets	Yes	Our code for DHPG and the novel environments with continuous symmetries are publicly available1. 1. https://github.com/sahandrez/homomorphic policy gradient... We demonstrate the eﬀectiveness of our method on our environments, as well as on challenging visual control tasks from the Deep Mind Control Suite.
Dataset Splits	No	The paper mentions evaluating on DM Control tasks with pixel observations and reports aggregated RLiable metrics over 14 tasks with 10 random seeds. However, it does not specify explicit training/validation/test dataset splits or methodologies for partitioning observations within these environments, as is typical for static datasets. In RL, the 'data' is generated through interaction, and reproducibility is often ensured by environment setup and random seeds for runs, not by dataset splits.
Hardware Specification	Yes	Our code is publicly available at https://github.com/sahandrez/homomorphic policy gradient. We implemented our method in Py Torch (Paszke et al., 2019) and results were obtained using Python v3.8.10, Py Torch v1.10.0, CUDA 11.4, and Mujoco 2.1.1 (Todorov et al., 2012) on A100 GPUs on a cloud computing service.
Software Dependencies	Yes	Our code is publicly available at https://github.com/sahandrez/homomorphic policy gradient. We implemented our method in Py Torch (Paszke et al., 2019) and results were obtained using Python v3.8.10, Py Torch v1.10.0, CUDA 11.4, and Mujoco 2.1.1 (Todorov et al., 2012) on A100 GPUs on a cloud computing service.
Experiment Setup	Yes	Table 1 present the hyperparameters used in our experiments. The hyperparameters are all adapted from Dr Q-v2 (Yarats et al., 2021a) without any further hyperparameter tuning. We have kept the same set of hyperparameters across all algorithms and tasks, except for the walker domain which similarly to Dr Q-v2 (Yarats et al., 2021a), we used n-step return of n = 1 and mini-batch size of 512. The core RL components (actor and critic networks), as well as the components of DHPG (state and action encoders, transition and reward models) are all MLP networks with the Re LU activation function and one hidden layer with dimension of 256. The image encoder is based on the architecture of Dr Q-v2 which is itself based on SAC-AE (Yarats et al., 2021b) and consists of four convolutional layers of 32 × 3 × 3 with Re LU as their activation functions, followed by a one-layer fully-connected neural network with layer normalization (Ba et al., 2016) and tanh activation function. The stride of the convolutional layers are 1, except for the ﬁrst layer which has stride 2.