Continuous MDP Homomorphisms and Homomorphic Policy Gradient

Authors: Sahand Rezaei-Shoshtari, Rosie Zhao, Prakash Panangaden, David Meger, Doina Precup

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate DHPG on continuous control tasks from DM Control on state and pixel observations. Importantly, to reliably evaluate our algorithm against the baselines and to correctly capture the distribution of results, we follow the best practices proposed by Agarwal et al. [5] and report the interquartile mean (IQM) and performance profiles aggregated on all tasks over 10 random seeds.
Researcher Affiliation Collaboration Sahand Rezaei-Shoshtari Mc Gill University and Mila Rosie Zhao Mc Gill University and Mila Prakash Panangaden Mc Gill University and Mila David Meger Mc Gill University and Mila Doina Precup Mc Gill University, Mila, and Deep Mind
Pseudocode Yes The pseudo-code of DHPG is presented in Appendix E.1.
Open Source Code Yes Our code is publicly available at https://github.com/sahandrez/homomorphic_policy_gradient.
Open Datasets Yes We evaluate DHPG on continuous control tasks from DM Control on state and pixel observations.We use the DeepMind Control Suite [77] for continuous control tasks.
Dataset Splits No The paper does not explicitly specify validation dataset splits. It mentions using a replay buffer for training but no clear split for validation.
Hardware Specification Yes All experiments were conducted on a single NVIDIA Quadro RTX 6000 GPU.
Software Dependencies No The paper mentions software like PyTorch, JAX, Haiku, Optax, and DM Control Suite, but it does not provide specific version numbers for these dependencies.
Experiment Setup Yes Training the Policy and Critic. Actual and abstract critics are trained using n-step TD error for a faster reward propagation [7]. The loss function for each critic is therefore defined as the expectation of the n-step Bellman error estimated over transitions samples from the replay buffer B: Lactual critic(ψ) = E(s,a,s ,r) B[(R(n) t + γn Qψ (st+n,at+n) Qψ(st,at)) 2] (9) Labstract critic(ψ,φ,η) = E(s,a,s ,r) B[(R(n) t + γn Qψ (st+n,at+n) Qψ(st,at)) 2], (10), where st=fφ(st) and at=gη(st,at) are computed using the learned MDP homomorphism, ψ and ψ denote parameters of target networks, and R(n) t = n 1 i=0 γirt+i is the n-step return. Consequently, we train the policy using DPG [70] and HPG from Theorem 5 by backpropagating the following loss: Lactor(θ) Es B[Qψ(s,πθ(s)) + Qψ(fφ(s),gη(s,πθ(s)))]. (11).Full hyperparameters are provided in Appendix E.2.