Dynamic Update-to-Data Ratio: Minimizing World Model Overfitting
Authors: Nicolai Dorka, Tim Welschehold, Wolfram Burgard
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We apply our method to Dreamer V2, a state-of-the-art model-based reinforcement learning algorithm, and evaluate it on the Deep Mind Control Suite and the Atari 100k benchmark. The results demonstrate that one can better balance underand overestimation by adjusting the UTD ratio with our approach compared to the default setting in Dreamer V2 and that it is competitive with an extensive hyperparameter search which is not feasible for many applications. |
| Researcher Affiliation | Academia | Nicolai Dorka1 Tim Welschehold1 Wolfram Burgard2 1University of Freiburg 2University of Technology Nuremberg dorka@cs.uni-freiburg.de |
| Pseudocode | Yes | Algorithm 1 DUTD (in terms of inverted UTD ratio) |
| Open Source Code | Yes | The source code of our implementation is publicly available 1. 1https://github.com/Nicolinho/dutd |
| Open Datasets | Yes | We evaluate DUTD applied to Dreamer V2 Hafner et al. (2021) on the Atari 100k benchmark Kaiser et al. (2019) and the Deep Mind Control Suite Tassa et al. (2018). |
| Dataset Splits | No | Not specified explicitly in traditional train/val/test percentages or fixed counts from a static dataset, as the dataset is continually evolving. The paper describes a dynamic collection strategy for validation data: 'Every 100,000 steps DUTD collects 3,000 transitions of additional validation data.' |
| Hardware Specification | No | The paper mentions 'our hardware' and the duration of runs on it (e.g., 'a single run would take roughly two weeks to run on our hardware') but does not specify any exact hardware components like CPU/GPU models or memory. |
| Software Dependencies | No | The paper uses and links to the Dreamer V2 codebase but does not explicitly list specific software dependencies with version numbers (e.g., Python, PyTorch, CUDA versions). |
| Experiment Setup | Yes | For each of the two benchmarks we use the respective hyperparameters provided by the authors in their original code base. Accordingly, the baseline IUTD ratio is set to a value of 5 for the control suite and 16 for Atari which we also use as initial value for our method. This means an update step is performed every 5 and 16 environment steps respectively. For both benchmarks we set the increment value of DUTD to c = 1.3 and the IUTD ratio is updated every 500 steps which corresponds to the length of one episode in the control suite (with a frameskip of 2). Every 100,000 steps DUTD collects 3,000 transitions of additional validation data. We cap the IUTD ratio in the interval [1, 15] for the control suite and in [1, 32] for Atari. |