Reinforcement Learning with Random Delays
Authors: Yann Bouteiller, Simon Ramstedt, Giovanni Beltrame, Christopher Pal, Jonathan Binas
ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | This is shown theoretically and also demonstrated practically on a delay-augmented version of the Mu Jo Co continuous control benchmark. |
| Researcher Affiliation | Academia | Yann Bouteiller Polytechnique Montreal yann.bouteiller@polymtl.ca Simon Ramstedt Mila, Mc Gill University simonramstedt@gmail.com Giovanni Beltrame Polytechnique Montreal Christopher Pal Mila, Polytechnique Montreal Jonathan Binas Mila, University of Montreal |
| Pseudocode | No | The paper does not contain any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Along with this work we release our code, including a wrapper that conveniently augments any Open AI gym environment with custom delays. |
| Open Datasets | Yes | In particular, this enables us to introduce random delays to the Gym Mu Jo Co continuous control suite (Brockman et al., 2016; Todorov et al.), which is otherwise turn-based. |
| Dataset Splits | No | The paper uses reinforcement learning environments (MuJoCo) and does not describe explicit train/validation/test dataset splits with percentages or sample counts, which are not typically applicable in this setting. |
| Hardware Specification | No | The paper thanks Element AI and Compute Canada for providing computational resources but does not specify any exact hardware details such as GPU models, CPU models, or memory. |
| Software Dependencies | No | The paper mentions using PyTorch for initialization and refers to the Adam optimizer, but it does not specify exact version numbers for any software libraries or dependencies. |
| Experiment Setup | Yes | Table 1: Hyperparameters lists specific values for Optimizer (Adam), Learning rate (0.0003), Discount factor (γ) (0.99), Batch size (128), Target weights update coefficient (τ) (0.005), Gradient steps / environment steps (1), Reward scale (5.0), Entropy scale (1.0), Replay memory size (1000000), Number of samples before training starts (10000), and Number of critics (2). |