Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Continuous MDP Homomorphisms and Homomorphic Policy Gradient
Authors: Sahand Rezaei-Shoshtari, Rosie Zhao, Prakash Panangaden, David Meger, Doina Precup
NeurIPS 2022 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate DHPG on continuous control tasks from DM Control on state and pixel observations. Importantly, to reliably evaluate our algorithm against the baselines and to correctly capture the distribution of results, we follow the best practices proposed by Agarwal et al. [5] and report the interquartile mean (IQM) and performance profiles aggregated on all tasks over 10 random seeds. |
| Researcher Affiliation | Collaboration | Sahand Rezaei-Shoshtari Mc Gill University and Mila Rosie Zhao Mc Gill University and Mila Prakash Panangaden Mc Gill University and Mila David Meger Mc Gill University and Mila Doina Precup Mc Gill University, Mila, and Deep Mind |
| Pseudocode | Yes | The pseudo-code of DHPG is presented in Appendix E.1. |
| Open Source Code | Yes | Our code is publicly available at https://github.com/sahandrez/homomorphic_policy_gradient. |
| Open Datasets | Yes | We evaluate DHPG on continuous control tasks from DM Control on state and pixel observations.We use the DeepMind Control Suite [77] for continuous control tasks. |
| Dataset Splits | No | The paper does not explicitly specify validation dataset splits. It mentions using a replay buffer for training but no clear split for validation. |
| Hardware Specification | Yes | All experiments were conducted on a single NVIDIA Quadro RTX 6000 GPU. |
| Software Dependencies | No | The paper mentions software like PyTorch, JAX, Haiku, Optax, and DM Control Suite, but it does not provide specific version numbers for these dependencies. |
| Experiment Setup | Yes | Training the Policy and Critic. Actual and abstract critics are trained using n-step TD error for a faster reward propagation [7]. The loss function for each critic is therefore defined as the expectation of the n-step Bellman error estimated over transitions samples from the replay buffer B: Lactual critic(ψ) = E(s,a,s ,r) B[(R(n) t + γn Qψ (st+n,at+n) Qψ(st,at)) 2] (9) Labstract critic(ψ,φ,η) = E(s,a,s ,r) B[(R(n) t + γn Qψ (st+n,at+n) Qψ(st,at)) 2], (10), where st=fφ(st) and at=gη(st,at) are computed using the learned MDP homomorphism, ψ and ψ denote parameters of target networks, and R(n) t = n 1 i=0 γirt+i is the n-step return. Consequently, we train the policy using DPG [70] and HPG from Theorem 5 by backpropagating the following loss: Lactor(θ) Es B[Qψ(s,πθ(s)) + Qψ(fφ(s),gη(s,πθ(s)))]. (11).Full hyperparameters are provided in Appendix E.2. |