Reinforcement Learning in Presence of Discrete Markovian Context Evolution
Authors: Hang Ren, Aivar Sootla, Taher Jafferjee, Junxiao Shen, Jun Wang, Haitham Bou Ammar
ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Finally, we demonstrate empirically (using gym environments cart-pole swing-up, drone, intersection) that our approach succeeds where state-of-the-art methods of other frameworks fail and elaborate on the reasons for such failures.In this section, we demonstrate that the HDP offers an effective prior for model learning, while the distillation procedure refines the model and can regulate the context set complexity. |
| Researcher Affiliation | Collaboration | Hang Ren Huawei UK R&D Aivar Sootla Huawei UK R&D Taher Jafferjee Huawei UK R&D Junxiao Shen Huawei UK R&D University of Cambridge Jun Wang University College London jun.wang@cs.ucl.ac.uk Haitham Bou-Ammar Huawei UK R&D and Honorary Lecturer at UCL haitham.ammar@huawei.com |
| Pseudocode | Yes | Algorithm 1: Learning to Control HDP-C-MDP |
| Open Source Code | No | No explicit statement or link providing concrete access to the authors' own source code for the described methodology is found. |
| Open Datasets | Yes | Initial testing on Cart-Pole Swing-up Task (Lovatto, 2019).In the drone environment (Panerati et al., 2021)In the highway intersection environment (Leurent, 2018) |
| Dataset Splits | No | The paper describes generating trajectories through interaction with reinforcement learning environments but does not specify fixed training, validation, or test dataset splits in typical percentages or sample counts. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU or CPU models, processor types, or memory used for running the experiments. |
| Software Dependencies | No | The paper mentions using software packages such as Pyro, Pytorch, and implementations of PPO and SAC, but it does not provide specific version numbers for these software components. |
| Experiment Setup | Yes | All the hyper-parameters are presented in Tables A1, A2 and A3. For model learning experiments we used 500 trajectory roll-outs and 500 epochs for optimization. In the cart-pole environment we used the higher learning rate for hard failure experiments when χ < 0 and used the lower learning rate for the soft failure experiments χ > 0. |