Deterministic Uncertainty Propagation for Improved Model-Based Offline Reinforcement Learning

Authors: Abdullah Akgül, Manuel Haussmann, Melih Kandemir

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We present a theoretical result demonstrating the strong dependency of suboptimality on the number of Monte Carlo samples taken per Bellman target calculation. Our main contribution is a deterministic approximation to the Bellman target that uses progressive moment matching, a method developed originally for deterministic variational inference. ... We also observe MOMBO to converge faster than these approaches in a large set of benchmark tasks. ... We present comprehensive experiment results showing that MOMBO significantly accelerates training convergence while maintaining asymptotic performance.
Researcher Affiliation Academia Abdullah Akgül Manuel Haußmann Melih Kandemir Department of Mathematics and Computer Science University of Southern Denmark Odense, Denmark {akgul,haussmann,kandemir}@imada.sdu.dk
Pseudocode Yes Algorithm 1 Deterministic uncertainty propagation through moment matching
Open Source Code Yes The source code of our algorithm is available at https://github.com/adinlab/MOMBO.
Open Datasets Yes We compare MOMBO against MOPO and MOBILE, the two representative PEVI variants, across twelve tasks from the D4RL dataset (Fu et al., 2020)...
Dataset Splits Yes After evaluating the performance of each model on a validation set, we select the Nelite best-performing ensemble elements for further processing. ... We apply early stopping with 5 steps using the validation dataset.
Hardware Specification Yes We perform our experiments on three computational resources: 1) Tesla V100 GPU, Intel(R) Xeon(R) Gold 6230 CPU at 2.10 GHz, and 46 GB of memory; 2) NVIDIA Tesla A100 GPU, AMD EPYC 7F72 CPU at 3.2 GHz, and 256 GB of memory; and 3) Ge Force RTX 4090 GPU, Intel(R) Core(TM) i7-14700K CPU at 5.6 GHz, and 96 GB of memory.
Software Dependencies No The paper mentions using 'Adam optimizer (Kingma and Ba, 2015)' and 'pytorch' (implicitly through 'Offline RL-Kit' in Appendix C.2), but does not provide specific version numbers for these or other key software dependencies like Python, CUDA, or specific library versions.
Experiment Setup Yes We train MOMBO for 3000 episodes on the D4RL dataset and 2000 episodes on the mixed dataset, performing updates to the policy and Q-function 1000 times per episode with a batch size of 256. We set the learning rate for the critics to 0.0003, while the learning rate for the actor is 0.0001. ... We set the discount factor to γ = 0.99 and the soft update parameter to τ = 0.005. The α parameter is learned during training... We set the batch size for rollouts to 50000...