Dr Jekyll & Mr Hyde: the strange case of off-policy policy updates

Authors: Romain Laroche, Remi Tachet des Combes

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We extensively test on finite MDPs where J&H demonstrates a superior ability to recover from converging to a suboptimal policy without impairing its speed of convergence. We also implement a deep version of the algorithm and test it on a simple problem where it shows promising results.
Researcher Affiliation Industry Romain Laroche Microsoft Research Montréal, Canada Rémi Tachet des Combes Microsoft Research Montréal, Canada
Pseudocode Yes Algorithm 1: Dr Jekyll & Mr Hyde algorithm. After initialization of parameters and buffers, we enter the main loop.
Open Source Code Yes All code available at http://aka.ms/jnh.
Open Datasets Yes We train J&H on a version of the Four Rooms environment [44], a 15x15 grid split into four rooms (see App. F for the exact layout).
Dataset Splits No The paper describes how data is collected and used for updates (e.g., replay buffers, on-policy/off-policy), but does not specify fixed train/validation/test dataset splits with percentages or counts for reproducibility, as is typical for static datasets in supervised learning. The environments used (Chain Domain, Random MDPs, Four Rooms) are typically simulated, generating data dynamically rather than using predefined splits of a static dataset.
Hardware Specification No All experiments were run on a single machine with a 4 core processor and 32GB of RAM. (This description is not specific enough as it lacks exact CPU models or GPU information).
Software Dependencies Yes The code for this paper is written in Python 3.8 and uses PyTorch 1.7.1, NumPy 1.20.1, and Matplotlib 3.3.4.
Experiment Setup Yes We test performance against time, learning rate η of the actor, MDP parameters |S| and β, offpoliciness ot, and policy entropy regularization weight λ, with both direct and softmax parametrizations, on the chain and random MDPs. The full report is available in App. D.