Bigger, Regularized, Optimistic: scaling for compute and sample efficient continuous control
Authors: Michal Nauman, Mateusz Ostaszewski, Krzysztof Jankowski, Piotr Miłoś, Marek Cygan
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct a thorough investigation into the interplay of scaling model capacity and domain-specific RL enhancements. These empirical findings inform the design choices underlying our proposed BRO (Bigger, Regularized, Optimistic) algorithm. The key insight behind BRO is that strong regularization allows for effective scaling of the critic networks, which, paired with optimistic exploration, leads to superior performance. |
| Researcher Affiliation | Collaboration | 1Ideas NCBR; 2University of Warsaw, 3Warsaw University of Technology, 4Polish Academy of Sciences, 5Nomagic. |
| Pseudocode | Yes | We describe the process of hyperparameter selection for all considered algorithms in Appendix D, and share BRO pseudocode in Appendix (Pseudocode 1). |
| Open Source Code | Yes | We implement BRO based on the Jax RL (Kostrikov, 2021) and make the code available under the following link: https://github.com/naumix/Bigger Regularized Optimistic |
| Open Datasets | Yes | Environments We consider a wide range of control tasks, encompassing a total of 40 diverse, complex continuous control tasks spanning three simulation domains: Deep Mind Control (Tassa et al., 2018), Meta World (Yu et al., 2020), and Myo Suite (Caggiano et al., 2022) (a detailed list of environments can be found in Appendix C). These tasks include high-dimensional state and action spaces (with |S| and |A| reaching 223 and 39 dimensions), sparse rewards, complex locomotion tasks, and physiologically accurate musculoskeletal motor control. |
| Dataset Splits | No | The paper uses established RL benchmarks for training and evaluation. It does not define explicit training/validation/test dataset splits with percentages, sample counts, or specific files for data generated during training in the conventional supervised learning sense. Instead, it details running experiments with multiple seeds and evaluating performance at certain environment steps. |
| Hardware Specification | Yes | Experiments were conducted on an NVIDIA A100 GPU with 10GB of RAM and 8 CPU cores of AMD EPYC 7742 processor. |
| Software Dependencies | No | The paper mentions 'Jax RL (Kostrikov, 2021)' and the 'RLiable package (Agarwal et al., 2021)' as tools used, along with optimizers like 'ADAMW' and 'ADAM' in Table 5. However, it does not provide specific version numbers for these software components, which is required for reproducible dependency descriptions. |
| Experiment Setup | Yes | Hyperparameters of BRO and other baselines are listed in Table 5. Table 5: Hyperparameter values for actor-critic agents used in the experiments. (Lists Batch size, Replay ratio, Critic hidden depth/size, Actor depth/size, Num quantiles, KL target, Initial optimism, Std multiplier, Actor/Critic/Temperature learning rate, Optimizer, Discount, Initial temperature, Exploratory steps, Target entropy, Polyak weight) |