Piecewise Linear Parametrization of Policies: Towards Interpretable Deep Reinforcement Learning

Authors: Maxime Wabartha, Joelle Pineau

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate HC policies in control and navigation experiments, visualize the improved interpretability of the agent and highlight its trade-off with performance. Moreover, we validate that the restricted model class that the Hyper Combinator belongs to is compatible with the algorithmic constraints of various reinforcement learning algorithms.
Researcher Affiliation Collaboration Maxime Wabartha Mc Gill University, Mila Joelle Pineau Mc Gill University, Mila, FAIR at Meta
Pseudocode Yes Algorithm 1 SAC (with Hyper Combinator actor) ... Algorithm 2 Update Actor And Alpha
Open Source Code No The paper states 'We base ourselves on an open-source Py Torch implementation of SAC (Yarats & Kostrikov, 2020)' and 'We base our experiments on the open-source code provided by RIS (Chane-Sane et al., 2021)'. These refer to third-party baseline implementations, not an explicit release of the Hyper Combinator code developed in this paper.
Open Datasets Yes We evaluate how well HC policies can control proprioceptive variables such as the joints of a robot through the Deep Mind Control Suite benchmark (Tassa et al., 2018).
Dataset Splits No The paper describes evaluation procedures like 'We evaluate the agent every 10000 timesteps by rolling it out for 10 episodes and taking the average return' and 'Every 10000 steps, we roll out 5 evaluation episodes'. However, it does not explicitly provide traditional training/test/validation dataset splits, which is common for reinforcement learning research that involves environmental interaction rather than static datasets.
Hardware Specification Yes All the GPUs were NVIDIA Tesla V100, with 16GB memory available. The CPUs were Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz Each seed was allocated 1 GPU, 10 CPUs, and 64GB of RAM.
Software Dependencies No The paper mentions software like 'Python (Van Rossum & Drake Jr, 1995)', 'numpy (Van Der Walt et al., 2011)', 'matplotlib (Hunter, 2007)', and 'Py Torch (Paszke et al., 2017)' in the acknowledgements, implying their respective release years. However, it does not provide specific version numbers for these software dependencies (e.g., 'PyTorch 1.9' or 'Python 3.8').
Experiment Setup Yes Table 3: Full list of hyperparameters in the control experiments. Includes: Action repeat 1, Discount factor 0.99, Learnable α True, Initial α 0.1, α learning rate λα 1e-4, Actor learning rate λπ 1e-4, Actor update frequency 1, Critic architecture [1024, 1024], Critic learning rate λQ 1e-4, Batch size 1024, log σmin -5, log σmax 2, Gumbel net architecture [1024, 1024, 1024], Sub-policy assignation entropy coefficient λassig 0.001, Gumbel temperature 1.