reproducibilityindex.ai

Simple Ingredients for Offline Reinforcement Learning

Authors: Edoardo Cetin, Andrea Tirinzoni, Matteo Pirotta, Alessandro Lazaric, Yann Ollivier, Ahmed Touati

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct a large empirical study where we formulate and test several hypotheses to explain this failure. and Through a systematic empirical analysis, we test these hypotheses and solutions across three representative algorithms (TD3+BC, AWAC, IQL) and various hyperparameters, conducting over 50,000 experiments.
Researcher Affiliation	Industry	1Sakana AI, Tokyo, Japan, work done at Meta 2FAIR at Meta, Paris, France.
Pseudocode	Yes	Algorithm 1 Online deployment with evaluation sampling and Algorithm 2 Advantage Sampled Actor Critic (ASAC)
Open Source Code	Yes	We provide access to our code at: https://github.com/facebookresearch/offline_rl.
Open Datasets	Yes	Prior offline RL methods have been extensively tested and validated using well-known benchmarks such as D4RL (Fu et al., 2020) and RL-unplugged (Gulcehre et al., 2020)... and We build MOOD on top of the Deep Mind Control suite (Tassa et al., 2018)...
Dataset Splits	No	The paper mentions conducting experiments and selecting configurations but does not explicitly state the train/validation/test dataset splits for the main RL experiments in percentages or specific counts. While Appendix E.1 mentions 'reserved validation data 15%' for a density model, this is not for the primary RL experiments.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU or CPU models, processor types, or memory amounts used for running its experiments.
Software Dependencies	No	The paper mentions 'Numpy/Pytorch' but does not specify their version numbers or other key software components with specific versions required for replication.
Experiment Setup	Yes	We use a shallow network architecture (2 hidden layers of 256 units with Re LUs in between) for both the actor and the critic of all algorithms, as it is common in existing implementations. For each experiment, (i.e., pair of algorithm and dataset), we perform a grid search over the hyperparameters specific to each offline algorithm using 5 random seeds, and select the configurations that lead to the highest cumulative return after 1.5 × 10^6 optimization steps (5 × 10^6 for humanoid). and Table 9 summarizes the range of hyperparameters sweep we used in our experiments for each algorithms and testbed.