Offline Reinforcement Learning as Anti-exploration

Authors: Shideh Rezaeifar, Robert Dadashi, Nino Vieillard, Léonard Hussenot, Olivier Bachem, Olivier Pietquin, Matthieu Geist8106-8114

AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate the agent on the hand manipulation and locomotion tasks of the D4RL benchmark (Fu et al. 2020), and show that it is competitive with the state of the art.
Researcher Affiliation Collaboration Shideh Rezaeifar,1 University of Geneva 2 Google Research, Brain Team 3 Université de Lorraine, CNRS, Inria, IECL, F-54000 Nancy, France 4 Université de Lille, CNRS, Inria, UMR 9189 CRISt AL
Pseudocode Yes Algorithm 1: CVAE training. and Algorithm 2: Modified TD3 training.
Open Source Code No The paper does not provide an explicit statement about releasing the source code for the methodology or a link to a code repository.
Open Datasets Yes We evaluate the agent on the hand manipulation and locomotion tasks of the D4RL benchmark (Fu et al. 2020)
Dataset Splits No The paper describes the D4RL datasets used but does not explicitly provide specific training, validation, or test dataset splits (e.g., percentages or sample counts) used in their experiments.
Hardware Specification No The paper does not provide specific details about the hardware (e.g., GPU/CPU models, memory) used for running its experiments.
Software Dependencies No The paper mentions software components like TD3 and Adam optimizer, but does not provide specific version numbers for these or other key software libraries and dependencies.
Experiment Setup Yes The architecture of the TD3 actor and critic consists of a network with two hidden layers of size 256, the first layer has a tanh activation and the second layer has an elu activation. The actor outputs actions with a tanh activation, which is scaled by the action boundaries of each environment. Except from the activation functions, we use the default parameters of TD3 from the authors implementation, and run 10^6 gradient steps using the Adam optimizer, with batches of size 256.