Uncertainty Weighted Actor-Critic for Offline Reinforcement Learning
Authors: Yue Wu, Shuangfei Zhai, Nitish Srivastava, Joshua M Susskind, Jian Zhang, Ruslan Salakhutdinov, Hanlin Goh
ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, we observe that UWAC substantially improves model stability during training. In addition, UWAC out-performs existing offline RL methods on a variety of competitive tasks, and achieves significant performance gains over the state-of-the-art baseline on datasets with sparse demonstrations collected from human experts. Our experiments are structured as follows: In section 5.1, we validate and visualize the effectiveness of dropout uncertainty estimation in RL. In section 5.2 we present competitive benchmarking results on the widely-used D4RL Mu Jo Co walkers dataset. We then experiment with the more complex Adroit hand manipulation environment in section 5.3, and analyze the training stability and the effectiveness against OOD samples by examining the Q target functions in section 5.4. We report the implementation details 1 in section 5.5, ablation studies 5.6, and training time in A.2. |
| Researcher Affiliation | Collaboration | 1Apple Inc. 2Carnegie Mellon University. |
| Pseudocode | Yes | Algorithm 1 Pseudo code for UWAC, differences from (Kumar et al., 2019) are colored |
| Open Source Code | Yes | Code available at github.com/apple/ml-uwac |
| Open Datasets | Yes | We evaluate our method on the Mu Jo Co datasets in the D4RL benchmarks (Fu et al., 2020), including three environments (halfcheetah, hopper, and walker2d) and five dataset types (random, medium, medium-replay, medium-expert, expert), yielding a total of 15 problem settings. The Adroit dataset in the D4RL benchmarks (Rajeswaran et al., 2017) involves controlling a 24-Do F simulated hand to perform 4 tasks including hammering a nail, opening a door, twirling a pen, and picking/moving a ball. |
| Dataset Splits | No | The paper mentions using well-known benchmark datasets (D4RL Mu Jo Co, D4RL Adroit) which often have standard splits, but it does not explicitly provide the specific percentages, sample counts, or detailed methodology for splitting the datasets into training, validation, and test sets within the paper's text. |
| Hardware Specification | No | The paper does not provide specific hardware details such as exact GPU/CPU models, processor types, or memory amounts used for running its experiments. |
| Software Dependencies | No | The paper mentions using existing libraries and environments such as 'Open AI gym Lunar Lander-v2' and refers to using the 'official Git Hub code of BEAR' but does not provide specific version numbers for any software dependencies like Python, PyTorch, TensorFlow, or other libraries. |
| Experiment Setup | Yes | For the choice of β in Algorithm 1, we swept over values from the set {0.8, 1.6, 2.5}, determined by matching the average uncertainty output during training time. We ran parameter search on all the recommended parameters kernel type {gaussian, laplacian}, mmd sigma {10,20}, 100 actions sampled for evaluation, and 0.07 being the mmd target threshold. We keep the hyper-parameters and the network architecture exactly the same as in BEAR. |