reproducibilityindex.ai

Dr Jekyll & Mr Hyde: the strange case of off-policy policy updates

Authors: Romain Laroche, Remi Tachet des Combes

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We extensively test on ﬁnite MDPs where J&H demonstrates a superior ability to recover from converging to a suboptimal policy without impairing its speed of convergence. We also implement a deep version of the algorithm and test it on a simple problem where it shows promising results.
Researcher Affiliation	Industry	Romain Laroche Microsoft Research Montréal, Canada Rémi Tachet des Combes Microsoft Research Montréal, Canada
Pseudocode	Yes	Algorithm 1: Dr Jekyll & Mr Hyde algorithm. After initialization of parameters and buffers, we enter the main loop.
Open Source Code	Yes	All code available at http://aka.ms/jnh.
Open Datasets	Yes	We train J&H on a version of the Four Rooms environment [44], a 15x15 grid split into four rooms (see App. F for the exact layout).
Dataset Splits	No	The paper describes how data is collected and used for updates (e.g., replay buffers, on-policy/off-policy), but does not specify fixed train/validation/test dataset splits with percentages or counts for reproducibility, as is typical for static datasets in supervised learning. The environments used (Chain Domain, Random MDPs, Four Rooms) are typically simulated, generating data dynamically rather than using predefined splits of a static dataset.
Hardware Specification	No	All experiments were run on a single machine with a 4 core processor and 32GB of RAM. (This description is not specific enough as it lacks exact CPU models or GPU information).
Software Dependencies	Yes	The code for this paper is written in Python 3.8 and uses PyTorch 1.7.1, NumPy 1.20.1, and Matplotlib 3.3.4.
Experiment Setup	Yes	We test performance against time, learning rate η of the actor, MDP parameters \|S\| and β, offpoliciness ot, and policy entropy regularization weight λ, with both direct and softmax parametrizations, on the chain and random MDPs. The full report is available in App. D.