Discovering Policies with DOMiNO: Diversity Optimization Maintaining Near Optimality

Authors: Tom Zahavy, Yannick Schroecker, Feryal Behbahani, Kate Baumli, Sebastian Flennerhag, Shaobo Hou, Satinder Singh

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments are designed to validate and get confidence in the DOMi NO agent. We emphasize that we do not explicitly compare DOMi NO with previous work nor argue that one works better than the other. Instead, we address the following questions: (a) Can DOMi NO discover diverse policies that are near optimal? see Fig. 2, Appendix C.1, Fig. 1b and the videos in the supplementary. (b) Can DOMi NO balance the QD trade-off? see Fig. 2, Fig. 2 & 3. (c) Do the discovered policies enable robustness and fast adaptation to perturbations of the environment? (see Fig. 4).
Researcher Affiliation Industry Tom Zahavy, Yannick Schroecker, Feryal Behbahani, Kate Baumli, Sebastian Flennerhag, Shaobo Hou and Satinder Singh Deep Mind, London
Pseudocode Yes Pseudo code and further implementation details, as well as treatment of the discounted state occupancy, can be found in Appendix B.
Open Source Code No The paper references existing open-source libraries used (e.g., rlax), but does not state that the authors' own implementation of DOMi NO is open-source or provide a link to its source code.
Open Datasets Yes We conducted most of our experiments on domains from the DM Control Suite (Tassa et al., 2018), standard continuous control locomotion tasks where diverse near-optimal policies should naturally correspond to different gaits.
Dataset Splits No The paper reports 95% confidence intervals and uses multiple seeds for experiments, but it does not specify train/validation/test dataset splits (e.g., percentages or counts) or a cross-validation setup for reproducibility.
Hardware Specification No The paper does not provide specific details about the hardware (e.g., GPU/CPU models, memory) used for running the experiments.
Software Dependencies No The paper mentions using 'RLAX' and optimizers like 'RMSprop' and 'Adam', but it does not specify version numbers for these software components.
Experiment Setup Yes The hyperparameters in Table 2 are shared across all environments except in the Bi Pedal Domain the learning rate is set to 10 5 and the learner frames are 5 107. We report the DOMi NO specific hyperparameters in Table 3.