Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Reinforcement Learning in Presence of Discrete Markovian Context Evolution
Authors: Hang Ren, Aivar Sootla, Taher Jafferjee, Junxiao Shen, Jun Wang, Haitham Bou Ammar
ICLR 2022 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Finally, we demonstrate empirically (using gym environments cart-pole swing-up, drone, intersection) that our approach succeeds where state-of-the-art methods of other frameworks fail and elaborate on the reasons for such failures.In this section, we demonstrate that the HDP offers an effective prior for model learning, while the distillation procedure reο¬nes the model and can regulate the context set complexity. |
| Researcher Affiliation | Collaboration | Hang Ren Huawei UK R&D Aivar Sootla Huawei UK R&D Taher Jafferjee Huawei UK R&D Junxiao Shen Huawei UK R&D University of Cambridge Jun Wang University College London EMAIL Haitham Bou-Ammar Huawei UK R&D and Honorary Lecturer at UCL EMAIL |
| Pseudocode | Yes | Algorithm 1: Learning to Control HDP-C-MDP |
| Open Source Code | No | No explicit statement or link providing concrete access to the authors' own source code for the described methodology is found. |
| Open Datasets | Yes | Initial testing on Cart-Pole Swing-up Task (Lovatto, 2019).In the drone environment (Panerati et al., 2021)In the highway intersection environment (Leurent, 2018) |
| Dataset Splits | No | The paper describes generating trajectories through interaction with reinforcement learning environments but does not specify fixed training, validation, or test dataset splits in typical percentages or sample counts. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU or CPU models, processor types, or memory used for running the experiments. |
| Software Dependencies | No | The paper mentions using software packages such as Pyro, Pytorch, and implementations of PPO and SAC, but it does not provide specific version numbers for these software components. |
| Experiment Setup | Yes | All the hyper-parameters are presented in Tables A1, A2 and A3. For model learning experiments we used 500 trajectory roll-outs and 500 epochs for optimization. In the cart-pole environment we used the higher learning rate for hard failure experiments when Ο < 0 and used the lower learning rate for the soft failure experiments Ο > 0. |