reproducibilityindex.ai

Continuous MDP Homomorphisms and Homomorphic Policy Gradient

Authors: Sahand Rezaei-Shoshtari, Rosie Zhao, Prakash Panangaden, David Meger, Doina Precup

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate DHPG on continuous control tasks from DM Control on state and pixel observations. Importantly, to reliably evaluate our algorithm against the baselines and to correctly capture the distribution of results, we follow the best practices proposed by Agarwal et al. [5] and report the interquartile mean (IQM) and performance proﬁles aggregated on all tasks over 10 random seeds.
Researcher Affiliation	Collaboration	Sahand Rezaei-Shoshtari Mc Gill University and Mila Rosie Zhao Mc Gill University and Mila Prakash Panangaden Mc Gill University and Mila David Meger Mc Gill University and Mila Doina Precup Mc Gill University, Mila, and Deep Mind
Pseudocode	Yes	The pseudo-code of DHPG is presented in Appendix E.1.
Open Source Code	Yes	Our code is publicly available at https://github.com/sahandrez/homomorphic_policy_gradient.
Open Datasets	Yes	We evaluate DHPG on continuous control tasks from DM Control on state and pixel observations.We use the DeepMind Control Suite [77] for continuous control tasks.
Dataset Splits	No	The paper does not explicitly specify validation dataset splits. It mentions using a replay buffer for training but no clear split for validation.
Hardware Specification	Yes	All experiments were conducted on a single NVIDIA Quadro RTX 6000 GPU.
Software Dependencies	No	The paper mentions software like PyTorch, JAX, Haiku, Optax, and DM Control Suite, but it does not provide specific version numbers for these dependencies.
Experiment Setup	Yes	Training the Policy and Critic. Actual and abstract critics are trained using n-step TD error for a faster reward propagation [7]. The loss function for each critic is therefore deﬁned as the expectation of the n-step Bellman error estimated over transitions samples from the replay buffer B: Lactual critic(ψ) = E(s,a,s ,r) B[(R(n) t + γn Qψ (st+n,at+n) Qψ(st,at)) 2] (9) Labstract critic(ψ,φ,η) = E(s,a,s ,r) B[(R(n) t + γn Qψ (st+n,at+n) Qψ(st,at)) 2], (10), where st=fφ(st) and at=gη(st,at) are computed using the learned MDP homomorphism, ψ and ψ denote parameters of target networks, and R(n) t = n 1 i=0 γirt+i is the n-step return. Consequently, we train the policy using DPG [70] and HPG from Theorem 5 by backpropagating the following loss: Lactor(θ) Es B[Qψ(s,πθ(s)) + Qψ(fφ(s),gη(s,πθ(s)))]. (11).Full hyperparameters are provided in Appendix E.2.