Continuous MDP Homomorphisms and Homomorphic Policy Gradient
Authors: Sahand Rezaei-Shoshtari, Rosie Zhao, Prakash Panangaden, David Meger, Doina Precup
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate DHPG on continuous control tasks from DM Control on state and pixel observations. Importantly, to reliably evaluate our algorithm against the baselines and to correctly capture the distribution of results, we follow the best practices proposed by Agarwal et al. [5] and report the interquartile mean (IQM) and performance profiles aggregated on all tasks over 10 random seeds. |
| Researcher Affiliation | Collaboration | Sahand Rezaei-Shoshtari Mc Gill University and Mila Rosie Zhao Mc Gill University and Mila Prakash Panangaden Mc Gill University and Mila David Meger Mc Gill University and Mila Doina Precup Mc Gill University, Mila, and Deep Mind |
| Pseudocode | Yes | The pseudo-code of DHPG is presented in Appendix E.1. |
| Open Source Code | Yes | Our code is publicly available at https://github.com/sahandrez/homomorphic_policy_gradient. |
| Open Datasets | Yes | We evaluate DHPG on continuous control tasks from DM Control on state and pixel observations.We use the DeepMind Control Suite [77] for continuous control tasks. |
| Dataset Splits | No | The paper does not explicitly specify validation dataset splits. It mentions using a replay buffer for training but no clear split for validation. |
| Hardware Specification | Yes | All experiments were conducted on a single NVIDIA Quadro RTX 6000 GPU. |
| Software Dependencies | No | The paper mentions software like PyTorch, JAX, Haiku, Optax, and DM Control Suite, but it does not provide specific version numbers for these dependencies. |
| Experiment Setup | Yes | Training the Policy and Critic. Actual and abstract critics are trained using n-step TD error for a faster reward propagation [7]. The loss function for each critic is therefore defined as the expectation of the n-step Bellman error estimated over transitions samples from the replay buffer B: Lactual critic(ψ) = E(s,a,s ,r) B[(R(n) t + γn Qψ (st+n,at+n) Qψ(st,at)) 2] (9) Labstract critic(ψ,φ,η) = E(s,a,s ,r) B[(R(n) t + γn Qψ (st+n,at+n) Qψ(st,at)) 2], (10), where st=fφ(st) and at=gη(st,at) are computed using the learned MDP homomorphism, ψ and ψ denote parameters of target networks, and R(n) t = n 1 i=0 γirt+i is the n-step return. Consequently, we train the policy using DPG [70] and HPG from Theorem 5 by backpropagating the following loss: Lactor(θ) Es B[Qψ(s,πθ(s)) + Qψ(fφ(s),gη(s,πθ(s)))]. (11).Full hyperparameters are provided in Appendix E.2. |