Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Value-Distributional Model-Based Reinforcement Learning
Authors: Carlos E. Luis, Alessandro G. Bottero, Julia Vinogradska, Felix Berkenkamp, Jan Peters
JMLR 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Evaluation across several continuous-control tasks shows performance benefits with respect to both model-based and model-free algorithms. The code is available at https://github.com/boschresearch/dist-mbrl. Keywords: Model-Based Reinforcement Learning, Bayesian Reinforcement Learning, Distributional Reinforcement Learning, Uncertainty Quantification, Quantile Regresion. [...] 6. Experiments: In this section, we evaluate EQR-SAC in environments with continuous state-action spaces. |
| Researcher Affiliation | Collaboration | Carlos E. Luis EMAIL Bosch Corporate Research, TU Darmstadt [...] Alessandro G. Bottero EMAIL Bosch Corporate Research, TU Darmstadt [...] Jan Peters EMAIL TU Darmstadt, German Research Center for AI (DFKI), Hessian.AI |
| Pseudocode | Yes | Algorithm 1 Epistemic Quantile-Regression (EQR) [...] Algorithm 2 Epistemic Quantile-Regression with Soft Actor-Critic (EQR-SAC) |
| Open Source Code | Yes | The code is available at https://github.com/boschresearch/dist-mbrl. |
| Open Datasets | Yes | We plot the performance of EQR-SAC and all baselines in Figure 7. EQR-SAC and QR-MBPO have the best overall performance, both using the optimistic objective function fofu (as previously defined after (20)). These results highlight the need to model uncertainty and leverage it during optimization; optimizing the mean values significantly degraded performance of the distributional approaches. In Figure 8, we inspect further the distribution of values learned by EQR-SAC during a training run. The value distribution is initially wide and heavy-tailed, as the agent rarely visits goal states. At 5K steps, the policy is close-to-optimal but the predicted distribution underestimates the true values. In subsequent steps, the algorithm explores other policies while reducing uncertainty and calibrating the predicted value distribution. At 12K steps, the algorithm stabilizes again at the optimized policy, but with a calibrated value distribution whose mean is close to the empirical value. We notice the large uncertainty in the top-right corner of the state space remains (and typically does not vanish if we run the algorithm for longer); we hypothesize this is mainly an artifact of the discontinuous reward function, which is smoothened out differently by each member of the ensemble of dynamics, such that epistemic uncertainty stays high. [...] We consider a modification of the N-room gridworld environment by Domingues et al. (2021) [...] We plot the performance of EQR-SAC and all baselines in Figure 7. EQR-SAC and QR-MBPO have the best overall performance, both using the optimistic objective function fofu (as previously defined after (20)). These results highlight the need to model uncertainty and leverage it during optimization; optimizing the mean values significantly degraded performance of the distributional approaches. In Figure 8, we inspect further the distribution of values learned by EQR-SAC during a training run. The value distribution is initially wide and heavy-tailed, as the agent rarely visits goal states. At 5K steps, the policy is close-to-optimal but the predicted distribution underestimates the true values. In subsequent steps, the algorithm explores other policies while reducing uncertainty and calibrating the predicted value distribution. At 12K steps, the algorithm stabilizes again at the optimized policy, but with a calibrated value distribution whose mean is close to the empirical value. We notice the large uncertainty in the top-right corner of the state space remains (and typically does not vanish if we run the algorithm for longer); we hypothesize this is mainly an artifact of the discontinuous reward function, which is smoothened out differently by each member of the ensemble of dynamics, such that epistemic uncertainty stays high. [...] We motivate the importance of learning a distribution of values with a simple environment, the Mountain Car (Sutton and Barto, 2018) with continuous action space, as implemented in the gym library (Brockman et al., 2016). [...] In order to evaluate EQR-SAC more broadly, we conduct an experiment in a subset of 16 continuous-control tasks from the Deep Mind Control Suite (Tunyasuvunakool et al., 2020). |
| Dataset Splits | No | The paper describes reinforcement learning experiments where data is collected through interaction with environments (e.g., Mountain Car, Deep Mind Control Suite). It refers to 'experience replay buffer D' and 'model dataset Dmodel' which are continuously populated during training, rather than using predefined, static training, validation, and test splits of a fixed dataset, which is typical for supervised learning tasks. |
| Hardware Specification | No | The paper mentions 'GPU parallelization' in Section 6.5, but does not provide specific details such as GPU models, CPU models, memory, or other hardware specifications used for running the experiments. |
| Software Dependencies | No | The paper mentions software like the 'mbrl-lib Python library from Pineda et al. (2021)' and that the 'SAC base implementation follows the open-source repository https://github.com/pranz24/pytorch-soft-actor-critic', as well as the 'gym library (Brockman et al., 2016)'. However, it does not specify version numbers for these libraries, Python, PyTorch, or any other critical software dependencies. |
| Experiment Setup | Yes | Appendix D. Hyperparameters: Table 1: Hyperparameters for Deep Mind Control Suite. In red, we highlight the only deviations of the base hyperparameters across all environments and baselines. T # episodes 250, E steps per episode 10^3, Replay buffer D capacity 10^5, Warm-up steps (under initial policy) 5 × 10^3, G # gradient steps 10, Batch size 256, Auto-tuning of entropy coefficient α? Yes, Target entropy dim(A), Actor MLP network 2 hidden layers 128 neurons Tanh activations, Critic MLP network 2 hidden layers 256 neurons Tanh activations, Actor/Critic learning rate 3 × 10^-4, Dynamics Model n ensemble size 5, F frequency of model training (# steps) 250, L # model rollouts per step 400, k rollout length 5, # Model updates to retain data 10, Model buffer Dmodel capacity (EQR-SAC) L F k (n) = 5 × 10^6(25 × 10^6), Model MLP network (quadruped) 4 layers 200 (400) neurons SiLU activations, Learning rate 1 × 10^-3, Quantile Network m # quantiles 51, # (s', a') samples (EQR-SAC only) 25 |