reproducibilityindex.ai

Offline Reinforcement Learning as Anti-exploration

Authors: Shideh Rezaeifar, Robert Dadashi, Nino Vieillard, Léonard Hussenot, Olivier Bachem, Olivier Pietquin, Matthieu Geist8106-8114

AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate the agent on the hand manipulation and locomotion tasks of the D4RL benchmark (Fu et al. 2020), and show that it is competitive with the state of the art.
Researcher Affiliation	Collaboration	Shideh Rezaeifar,1 University of Geneva 2 Google Research, Brain Team 3 Université de Lorraine, CNRS, Inria, IECL, F-54000 Nancy, France 4 Université de Lille, CNRS, Inria, UMR 9189 CRISt AL
Pseudocode	Yes	Algorithm 1: CVAE training. and Algorithm 2: Modified TD3 training.
Open Source Code	No	The paper does not provide an explicit statement about releasing the source code for the methodology or a link to a code repository.
Open Datasets	Yes	We evaluate the agent on the hand manipulation and locomotion tasks of the D4RL benchmark (Fu et al. 2020)
Dataset Splits	No	The paper describes the D4RL datasets used but does not explicitly provide specific training, validation, or test dataset splits (e.g., percentages or sample counts) used in their experiments.
Hardware Specification	No	The paper does not provide specific details about the hardware (e.g., GPU/CPU models, memory) used for running its experiments.
Software Dependencies	No	The paper mentions software components like TD3 and Adam optimizer, but does not provide specific version numbers for these or other key software libraries and dependencies.
Experiment Setup	Yes	The architecture of the TD3 actor and critic consists of a network with two hidden layers of size 256, the first layer has a tanh activation and the second layer has an elu activation. The actor outputs actions with a tanh activation, which is scaled by the action boundaries of each environment. Except from the activation functions, we use the default parameters of TD3 from the authors implementation, and run 10^6 gradient steps using the Adam optimizer, with batches of size 256.