reproducibilityindex.ai

Reinforcement Learning in Presence of Discrete Markovian Context Evolution

Authors: Hang Ren, Aivar Sootla, Taher Jafferjee, Junxiao Shen, Jun Wang, Haitham Bou Ammar

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Finally, we demonstrate empirically (using gym environments cart-pole swing-up, drone, intersection) that our approach succeeds where state-of-the-art methods of other frameworks fail and elaborate on the reasons for such failures.In this section, we demonstrate that the HDP offers an effective prior for model learning, while the distillation procedure reﬁnes the model and can regulate the context set complexity.
Researcher Affiliation	Collaboration	Hang Ren Huawei UK R&D Aivar Sootla Huawei UK R&D Taher Jafferjee Huawei UK R&D Junxiao Shen Huawei UK R&D University of Cambridge Jun Wang University College London jun.wang@cs.ucl.ac.uk Haitham Bou-Ammar Huawei UK R&D and Honorary Lecturer at UCL haitham.ammar@huawei.com
Pseudocode	Yes	Algorithm 1: Learning to Control HDP-C-MDP
Open Source Code	No	No explicit statement or link providing concrete access to the authors' own source code for the described methodology is found.
Open Datasets	Yes	Initial testing on Cart-Pole Swing-up Task (Lovatto, 2019).In the drone environment (Panerati et al., 2021)In the highway intersection environment (Leurent, 2018)
Dataset Splits	No	The paper describes generating trajectories through interaction with reinforcement learning environments but does not specify fixed training, validation, or test dataset splits in typical percentages or sample counts.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU or CPU models, processor types, or memory used for running the experiments.
Software Dependencies	No	The paper mentions using software packages such as Pyro, Pytorch, and implementations of PPO and SAC, but it does not provide specific version numbers for these software components.
Experiment Setup	Yes	All the hyper-parameters are presented in Tables A1, A2 and A3. For model learning experiments we used 500 trajectory roll-outs and 500 epochs for optimization. In the cart-pole environment we used the higher learning rate for hard failure experiments when χ < 0 and used the lower learning rate for the soft failure experiments χ > 0.