reproducibilityindex.ai

Bigger, Regularized, Optimistic: scaling for compute and sample efficient continuous control

Authors: Michal Nauman, Mateusz Ostaszewski, Krzysztof Jankowski, Piotr Miłoś, Marek Cygan

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct a thorough investigation into the interplay of scaling model capacity and domain-specific RL enhancements. These empirical findings inform the design choices underlying our proposed BRO (Bigger, Regularized, Optimistic) algorithm. The key insight behind BRO is that strong regularization allows for effective scaling of the critic networks, which, paired with optimistic exploration, leads to superior performance.
Researcher Affiliation	Collaboration	1Ideas NCBR; 2University of Warsaw, 3Warsaw University of Technology, 4Polish Academy of Sciences, 5Nomagic.
Pseudocode	Yes	We describe the process of hyperparameter selection for all considered algorithms in Appendix D, and share BRO pseudocode in Appendix (Pseudocode 1).
Open Source Code	Yes	We implement BRO based on the Jax RL (Kostrikov, 2021) and make the code available under the following link: https://github.com/naumix/Bigger Regularized Optimistic
Open Datasets	Yes	Environments We consider a wide range of control tasks, encompassing a total of 40 diverse, complex continuous control tasks spanning three simulation domains: Deep Mind Control (Tassa et al., 2018), Meta World (Yu et al., 2020), and Myo Suite (Caggiano et al., 2022) (a detailed list of environments can be found in Appendix C). These tasks include high-dimensional state and action spaces (with \|S\| and \|A\| reaching 223 and 39 dimensions), sparse rewards, complex locomotion tasks, and physiologically accurate musculoskeletal motor control.
Dataset Splits	No	The paper uses established RL benchmarks for training and evaluation. It does not define explicit training/validation/test dataset splits with percentages, sample counts, or specific files for data generated during training in the conventional supervised learning sense. Instead, it details running experiments with multiple seeds and evaluating performance at certain environment steps.
Hardware Specification	Yes	Experiments were conducted on an NVIDIA A100 GPU with 10GB of RAM and 8 CPU cores of AMD EPYC 7742 processor.
Software Dependencies	No	The paper mentions 'Jax RL (Kostrikov, 2021)' and the 'RLiable package (Agarwal et al., 2021)' as tools used, along with optimizers like 'ADAMW' and 'ADAM' in Table 5. However, it does not provide specific version numbers for these software components, which is required for reproducible dependency descriptions.
Experiment Setup	Yes	Hyperparameters of BRO and other baselines are listed in Table 5. Table 5: Hyperparameter values for actor-critic agents used in the experiments. (Lists Batch size, Replay ratio, Critic hidden depth/size, Actor depth/size, Num quantiles, KL target, Initial optimism, Std multiplier, Actor/Critic/Temperature learning rate, Optimizer, Discount, Initial temperature, Exploratory steps, Target entropy, Polyak weight)