Reinforcement Learning in Presence of Discrete Markovian Context Evolution

Authors: Hang Ren, Aivar Sootla, Taher Jafferjee, Junxiao Shen, Jun Wang, Haitham Bou Ammar

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Finally, we demonstrate empirically (using gym environments cart-pole swing-up, drone, intersection) that our approach succeeds where state-of-the-art methods of other frameworks fail and elaborate on the reasons for such failures.In this section, we demonstrate that the HDP offers an effective prior for model learning, while the distillation procedure refines the model and can regulate the context set complexity.
Researcher Affiliation Collaboration Hang Ren Huawei UK R&D Aivar Sootla Huawei UK R&D Taher Jafferjee Huawei UK R&D Junxiao Shen Huawei UK R&D University of Cambridge Jun Wang University College London jun.wang@cs.ucl.ac.uk Haitham Bou-Ammar Huawei UK R&D and Honorary Lecturer at UCL haitham.ammar@huawei.com
Pseudocode Yes Algorithm 1: Learning to Control HDP-C-MDP
Open Source Code No No explicit statement or link providing concrete access to the authors' own source code for the described methodology is found.
Open Datasets Yes Initial testing on Cart-Pole Swing-up Task (Lovatto, 2019).In the drone environment (Panerati et al., 2021)In the highway intersection environment (Leurent, 2018)
Dataset Splits No The paper describes generating trajectories through interaction with reinforcement learning environments but does not specify fixed training, validation, or test dataset splits in typical percentages or sample counts.
Hardware Specification No The paper does not provide specific hardware details such as GPU or CPU models, processor types, or memory used for running the experiments.
Software Dependencies No The paper mentions using software packages such as Pyro, Pytorch, and implementations of PPO and SAC, but it does not provide specific version numbers for these software components.
Experiment Setup Yes All the hyper-parameters are presented in Tables A1, A2 and A3. For model learning experiments we used 500 trajectory roll-outs and 500 epochs for optimization. In the cart-pole environment we used the higher learning rate for hard failure experiments when χ < 0 and used the lower learning rate for the soft failure experiments χ > 0.