Reinforcement Learning with Random Delays

Authors: Yann Bouteiller, Simon Ramstedt, Giovanni Beltrame, Christopher Pal, Jonathan Binas

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental This is shown theoretically and also demonstrated practically on a delay-augmented version of the Mu Jo Co continuous control benchmark.
Researcher Affiliation Academia Yann Bouteiller Polytechnique Montreal yann.bouteiller@polymtl.ca Simon Ramstedt Mila, Mc Gill University simonramstedt@gmail.com Giovanni Beltrame Polytechnique Montreal Christopher Pal Mila, Polytechnique Montreal Jonathan Binas Mila, University of Montreal
Pseudocode No The paper does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code Yes Along with this work we release our code, including a wrapper that conveniently augments any Open AI gym environment with custom delays.
Open Datasets Yes In particular, this enables us to introduce random delays to the Gym Mu Jo Co continuous control suite (Brockman et al., 2016; Todorov et al.), which is otherwise turn-based.
Dataset Splits No The paper uses reinforcement learning environments (MuJoCo) and does not describe explicit train/validation/test dataset splits with percentages or sample counts, which are not typically applicable in this setting.
Hardware Specification No The paper thanks Element AI and Compute Canada for providing computational resources but does not specify any exact hardware details such as GPU models, CPU models, or memory.
Software Dependencies No The paper mentions using PyTorch for initialization and refers to the Adam optimizer, but it does not specify exact version numbers for any software libraries or dependencies.
Experiment Setup Yes Table 1: Hyperparameters lists specific values for Optimizer (Adam), Learning rate (0.0003), Discount factor (γ) (0.99), Batch size (128), Target weights update coefficient (τ) (0.005), Gradient steps / environment steps (1), Reward scale (5.0), Entropy scale (1.0), Replay memory size (1000000), Number of samples before training starts (10000), and Number of critics (2).