reproducibilityindex.ai

Uncertainty Weighted Actor-Critic for Offline Reinforcement Learning

Authors: Yue Wu, Shuangfei Zhai, Nitish Srivastava, Joshua M Susskind, Jian Zhang, Ruslan Salakhutdinov, Hanlin Goh

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirically, we observe that UWAC substantially improves model stability during training. In addition, UWAC out-performs existing ofﬂine RL methods on a variety of competitive tasks, and achieves signiﬁcant performance gains over the state-of-the-art baseline on datasets with sparse demonstrations collected from human experts. Our experiments are structured as follows: In section 5.1, we validate and visualize the effectiveness of dropout uncertainty estimation in RL. In section 5.2 we present competitive benchmarking results on the widely-used D4RL Mu Jo Co walkers dataset. We then experiment with the more complex Adroit hand manipulation environment in section 5.3, and analyze the training stability and the effectiveness against OOD samples by examining the Q target functions in section 5.4. We report the implementation details 1 in section 5.5, ablation studies 5.6, and training time in A.2.
Researcher Affiliation	Collaboration	1Apple Inc. 2Carnegie Mellon University.
Pseudocode	Yes	Algorithm 1 Pseudo code for UWAC, differences from (Kumar et al., 2019) are colored
Open Source Code	Yes	Code available at github.com/apple/ml-uwac
Open Datasets	Yes	We evaluate our method on the Mu Jo Co datasets in the D4RL benchmarks (Fu et al., 2020), including three environments (halfcheetah, hopper, and walker2d) and ﬁve dataset types (random, medium, medium-replay, medium-expert, expert), yielding a total of 15 problem settings. The Adroit dataset in the D4RL benchmarks (Rajeswaran et al., 2017) involves controlling a 24-Do F simulated hand to perform 4 tasks including hammering a nail, opening a door, twirling a pen, and picking/moving a ball.
Dataset Splits	No	The paper mentions using well-known benchmark datasets (D4RL Mu Jo Co, D4RL Adroit) which often have standard splits, but it does not explicitly provide the specific percentages, sample counts, or detailed methodology for splitting the datasets into training, validation, and test sets within the paper's text.
Hardware Specification	No	The paper does not provide specific hardware details such as exact GPU/CPU models, processor types, or memory amounts used for running its experiments.
Software Dependencies	No	The paper mentions using existing libraries and environments such as 'Open AI gym Lunar Lander-v2' and refers to using the 'ofﬁcial Git Hub code of BEAR' but does not provide specific version numbers for any software dependencies like Python, PyTorch, TensorFlow, or other libraries.
Experiment Setup	Yes	For the choice of β in Algorithm 1, we swept over values from the set {0.8, 1.6, 2.5}, determined by matching the average uncertainty output during training time. We ran parameter search on all the recommended parameters kernel type {gaussian, laplacian}, mmd sigma {10,20}, 100 actions sampled for evaluation, and 0.07 being the mmd target threshold. We keep the hyper-parameters and the network architecture exactly the same as in BEAR.