Deterministic Uncertainty Propagation for Improved Model-Based Offline Reinforcement Learning
Authors: Abdullah Akgül, Manuel Haussmann, Melih Kandemir
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We present a theoretical result demonstrating the strong dependency of suboptimality on the number of Monte Carlo samples taken per Bellman target calculation. Our main contribution is a deterministic approximation to the Bellman target that uses progressive moment matching, a method developed originally for deterministic variational inference. ... We also observe MOMBO to converge faster than these approaches in a large set of benchmark tasks. ... We present comprehensive experiment results showing that MOMBO significantly accelerates training convergence while maintaining asymptotic performance. |
| Researcher Affiliation | Academia | Abdullah Akgül Manuel Haußmann Melih Kandemir Department of Mathematics and Computer Science University of Southern Denmark Odense, Denmark {akgul,haussmann,kandemir}@imada.sdu.dk |
| Pseudocode | Yes | Algorithm 1 Deterministic uncertainty propagation through moment matching |
| Open Source Code | Yes | The source code of our algorithm is available at https://github.com/adinlab/MOMBO. |
| Open Datasets | Yes | We compare MOMBO against MOPO and MOBILE, the two representative PEVI variants, across twelve tasks from the D4RL dataset (Fu et al., 2020)... |
| Dataset Splits | Yes | After evaluating the performance of each model on a validation set, we select the Nelite best-performing ensemble elements for further processing. ... We apply early stopping with 5 steps using the validation dataset. |
| Hardware Specification | Yes | We perform our experiments on three computational resources: 1) Tesla V100 GPU, Intel(R) Xeon(R) Gold 6230 CPU at 2.10 GHz, and 46 GB of memory; 2) NVIDIA Tesla A100 GPU, AMD EPYC 7F72 CPU at 3.2 GHz, and 256 GB of memory; and 3) Ge Force RTX 4090 GPU, Intel(R) Core(TM) i7-14700K CPU at 5.6 GHz, and 96 GB of memory. |
| Software Dependencies | No | The paper mentions using 'Adam optimizer (Kingma and Ba, 2015)' and 'pytorch' (implicitly through 'Offline RL-Kit' in Appendix C.2), but does not provide specific version numbers for these or other key software dependencies like Python, CUDA, or specific library versions. |
| Experiment Setup | Yes | We train MOMBO for 3000 episodes on the D4RL dataset and 2000 episodes on the mixed dataset, performing updates to the policy and Q-function 1000 times per episode with a batch size of 256. We set the learning rate for the critics to 0.0003, while the learning rate for the actor is 0.0001. ... We set the discount factor to γ = 0.99 and the soft update parameter to τ = 0.005. The α parameter is learned during training... We set the batch size for rollouts to 50000... |