Discovering Policies with DOMiNO: Diversity Optimization Maintaining Near Optimality
Authors: Tom Zahavy, Yannick Schroecker, Feryal Behbahani, Kate Baumli, Sebastian Flennerhag, Shaobo Hou, Satinder Singh
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments are designed to validate and get confidence in the DOMi NO agent. We emphasize that we do not explicitly compare DOMi NO with previous work nor argue that one works better than the other. Instead, we address the following questions: (a) Can DOMi NO discover diverse policies that are near optimal? see Fig. 2, Appendix C.1, Fig. 1b and the videos in the supplementary. (b) Can DOMi NO balance the QD trade-off? see Fig. 2, Fig. 2 & 3. (c) Do the discovered policies enable robustness and fast adaptation to perturbations of the environment? (see Fig. 4). |
| Researcher Affiliation | Industry | Tom Zahavy, Yannick Schroecker, Feryal Behbahani, Kate Baumli, Sebastian Flennerhag, Shaobo Hou and Satinder Singh Deep Mind, London |
| Pseudocode | Yes | Pseudo code and further implementation details, as well as treatment of the discounted state occupancy, can be found in Appendix B. |
| Open Source Code | No | The paper references existing open-source libraries used (e.g., rlax), but does not state that the authors' own implementation of DOMi NO is open-source or provide a link to its source code. |
| Open Datasets | Yes | We conducted most of our experiments on domains from the DM Control Suite (Tassa et al., 2018), standard continuous control locomotion tasks where diverse near-optimal policies should naturally correspond to different gaits. |
| Dataset Splits | No | The paper reports 95% confidence intervals and uses multiple seeds for experiments, but it does not specify train/validation/test dataset splits (e.g., percentages or counts) or a cross-validation setup for reproducibility. |
| Hardware Specification | No | The paper does not provide specific details about the hardware (e.g., GPU/CPU models, memory) used for running the experiments. |
| Software Dependencies | No | The paper mentions using 'RLAX' and optimizers like 'RMSprop' and 'Adam', but it does not specify version numbers for these software components. |
| Experiment Setup | Yes | The hyperparameters in Table 2 are shared across all environments except in the Bi Pedal Domain the learning rate is set to 10 5 and the learner frames are 5 107. We report the DOMi NO specific hyperparameters in Table 3. |