Simple Ingredients for Offline Reinforcement Learning
Authors: Edoardo Cetin, Andrea Tirinzoni, Matteo Pirotta, Alessandro Lazaric, Yann Ollivier, Ahmed Touati
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct a large empirical study where we formulate and test several hypotheses to explain this failure. and Through a systematic empirical analysis, we test these hypotheses and solutions across three representative algorithms (TD3+BC, AWAC, IQL) and various hyperparameters, conducting over 50,000 experiments. |
| Researcher Affiliation | Industry | 1Sakana AI, Tokyo, Japan, work done at Meta 2FAIR at Meta, Paris, France. |
| Pseudocode | Yes | Algorithm 1 Online deployment with evaluation sampling and Algorithm 2 Advantage Sampled Actor Critic (ASAC) |
| Open Source Code | Yes | We provide access to our code at: https://github.com/facebookresearch/offline_rl. |
| Open Datasets | Yes | Prior offline RL methods have been extensively tested and validated using well-known benchmarks such as D4RL (Fu et al., 2020) and RL-unplugged (Gulcehre et al., 2020)... and We build MOOD on top of the Deep Mind Control suite (Tassa et al., 2018)... |
| Dataset Splits | No | The paper mentions conducting experiments and selecting configurations but does not explicitly state the train/validation/test dataset splits for the main RL experiments in percentages or specific counts. While Appendix E.1 mentions 'reserved validation data 15%' for a density model, this is not for the primary RL experiments. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU or CPU models, processor types, or memory amounts used for running its experiments. |
| Software Dependencies | No | The paper mentions 'Numpy/Pytorch' but does not specify their version numbers or other key software components with specific versions required for replication. |
| Experiment Setup | Yes | We use a shallow network architecture (2 hidden layers of 256 units with Re LUs in between) for both the actor and the critic of all algorithms, as it is common in existing implementations. For each experiment, (i.e., pair of algorithm and dataset), we perform a grid search over the hyperparameters specific to each offline algorithm using 5 random seeds, and select the configurations that lead to the highest cumulative return after 1.5 × 10^6 optimization steps (5 × 10^6 for humanoid). and Table 9 summarizes the range of hyperparameters sweep we used in our experiments for each algorithms and testbed. |