Dr Jekyll & Mr Hyde: the strange case of off-policy policy updates
Authors: Romain Laroche, Remi Tachet des Combes
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We extensively test on finite MDPs where J&H demonstrates a superior ability to recover from converging to a suboptimal policy without impairing its speed of convergence. We also implement a deep version of the algorithm and test it on a simple problem where it shows promising results. |
| Researcher Affiliation | Industry | Romain Laroche Microsoft Research Montréal, Canada Rémi Tachet des Combes Microsoft Research Montréal, Canada |
| Pseudocode | Yes | Algorithm 1: Dr Jekyll & Mr Hyde algorithm. After initialization of parameters and buffers, we enter the main loop. |
| Open Source Code | Yes | All code available at http://aka.ms/jnh. |
| Open Datasets | Yes | We train J&H on a version of the Four Rooms environment [44], a 15x15 grid split into four rooms (see App. F for the exact layout). |
| Dataset Splits | No | The paper describes how data is collected and used for updates (e.g., replay buffers, on-policy/off-policy), but does not specify fixed train/validation/test dataset splits with percentages or counts for reproducibility, as is typical for static datasets in supervised learning. The environments used (Chain Domain, Random MDPs, Four Rooms) are typically simulated, generating data dynamically rather than using predefined splits of a static dataset. |
| Hardware Specification | No | All experiments were run on a single machine with a 4 core processor and 32GB of RAM. (This description is not specific enough as it lacks exact CPU models or GPU information). |
| Software Dependencies | Yes | The code for this paper is written in Python 3.8 and uses PyTorch 1.7.1, NumPy 1.20.1, and Matplotlib 3.3.4. |
| Experiment Setup | Yes | We test performance against time, learning rate η of the actor, MDP parameters |S| and β, offpoliciness ot, and policy entropy regularization weight λ, with both direct and softmax parametrizations, on the chain and random MDPs. The full report is available in App. D. |