reproducibilityindex.ai

MAHALO: Unifying Offline Reinforcement Learning and Imitation Learning from Observations

Authors: Anqi Li, Byron Boots, Ching-An Cheng

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	5. Experiments. We aim to answer the following questions: (a) Is MAHALO effective in solving different instances of ofﬂine PLf O problems? (b) Can MAHALO achieve similar performance as other specialized algorithms? (c) Whether MAHALO can obtain comparable performance to oracle algorithms with full reward and dynamics information? (d) In what situation is the pessimistic reward function of MAHALO critical in achieving good performance?
Researcher Affiliation	Collaboration	Anqi Li 1 Byron Boots 1 Ching-An Cheng 2. 1University of Washington 2Microsoft Research.
Pseudocode	Yes	Algorithm 1 MAHALO (realized by ATAC)
Open Source Code	Yes	Our code is available at https://github.com/Anqi Li/mahalo.
Open Datasets	Yes	We consider three environments from D4RL (Fu et al., 2020): hopper-v2, walker2d-v2, and halfcheetah-v2. ... We additionally evaluate MAHALO and other baseline algorithms on ﬁve robot manipulation tasks from Meta World (Yu et al., 2020a).
Dataset Splits	No	The paper describes how the mixed-quality dynamics and expert datasets were constructed for training, but it does not specify explicit train/validation/test splits of these datasets for model training and evaluation in terms of percentages or sample counts. Evaluation is done by running the learned policy for a number of trials in the environment.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., GPU models, CPU types, memory amounts) used for running its experiments.
Software Dependencies	No	The paper mentions using 'ADAM' optimizer and adapting code from 'light ATAC' but does not provide specific version numbers for these or other software dependencies like Python, PyTorch, or CUDA.
Experiment Setup	Yes	We use a ﬁxed learning rate across all algorithms and experiments. Same as Cheng et al. (2022), we use ηslow = 5 10 7 for policy updates and ηfast = 5 10 4 for updating the critic (and reward for MAHALO). We use a common discount factor of γ = 0.99 for all experiments. For target update, we use τ = 0.005, same as Cheng et al. (2022); Haarnoja et al. (2018). We use a ﬁxed batch size of 256. ... For all algorithms based on ATAC, we warm-start training with 1) behavior cloning the behavioral policy and 2) learning a critic function to match the value of the behavioral policy. For MAHALO, we additionally train the reward function in this phase.