Piecewise Linear Parametrization of Policies: Towards Interpretable Deep Reinforcement Learning
Authors: Maxime Wabartha, Joelle Pineau
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate HC policies in control and navigation experiments, visualize the improved interpretability of the agent and highlight its trade-off with performance. Moreover, we validate that the restricted model class that the Hyper Combinator belongs to is compatible with the algorithmic constraints of various reinforcement learning algorithms. |
| Researcher Affiliation | Collaboration | Maxime Wabartha Mc Gill University, Mila Joelle Pineau Mc Gill University, Mila, FAIR at Meta |
| Pseudocode | Yes | Algorithm 1 SAC (with Hyper Combinator actor) ... Algorithm 2 Update Actor And Alpha |
| Open Source Code | No | The paper states 'We base ourselves on an open-source Py Torch implementation of SAC (Yarats & Kostrikov, 2020)' and 'We base our experiments on the open-source code provided by RIS (Chane-Sane et al., 2021)'. These refer to third-party baseline implementations, not an explicit release of the Hyper Combinator code developed in this paper. |
| Open Datasets | Yes | We evaluate how well HC policies can control proprioceptive variables such as the joints of a robot through the Deep Mind Control Suite benchmark (Tassa et al., 2018). |
| Dataset Splits | No | The paper describes evaluation procedures like 'We evaluate the agent every 10000 timesteps by rolling it out for 10 episodes and taking the average return' and 'Every 10000 steps, we roll out 5 evaluation episodes'. However, it does not explicitly provide traditional training/test/validation dataset splits, which is common for reinforcement learning research that involves environmental interaction rather than static datasets. |
| Hardware Specification | Yes | All the GPUs were NVIDIA Tesla V100, with 16GB memory available. The CPUs were Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz Each seed was allocated 1 GPU, 10 CPUs, and 64GB of RAM. |
| Software Dependencies | No | The paper mentions software like 'Python (Van Rossum & Drake Jr, 1995)', 'numpy (Van Der Walt et al., 2011)', 'matplotlib (Hunter, 2007)', and 'Py Torch (Paszke et al., 2017)' in the acknowledgements, implying their respective release years. However, it does not provide specific version numbers for these software dependencies (e.g., 'PyTorch 1.9' or 'Python 3.8'). |
| Experiment Setup | Yes | Table 3: Full list of hyperparameters in the control experiments. Includes: Action repeat 1, Discount factor 0.99, Learnable α True, Initial α 0.1, α learning rate λα 1e-4, Actor learning rate λπ 1e-4, Actor update frequency 1, Critic architecture [1024, 1024], Critic learning rate λQ 1e-4, Batch size 1024, log σmin -5, log σmax 2, Gumbel net architecture [1024, 1024, 1024], Sub-policy assignation entropy coefficient λassig 0.001, Gumbel temperature 1. |