MAHALO: Unifying Offline Reinforcement Learning and Imitation Learning from Observations

Authors: Anqi Li, Byron Boots, Ching-An Cheng

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 5. Experiments. We aim to answer the following questions: (a) Is MAHALO effective in solving different instances of offline PLf O problems? (b) Can MAHALO achieve similar performance as other specialized algorithms? (c) Whether MAHALO can obtain comparable performance to oracle algorithms with full reward and dynamics information? (d) In what situation is the pessimistic reward function of MAHALO critical in achieving good performance?
Researcher Affiliation Collaboration Anqi Li 1 Byron Boots 1 Ching-An Cheng 2. 1University of Washington 2Microsoft Research.
Pseudocode Yes Algorithm 1 MAHALO (realized by ATAC)
Open Source Code Yes Our code is available at https://github.com/Anqi Li/mahalo.
Open Datasets Yes We consider three environments from D4RL (Fu et al., 2020): hopper-v2, walker2d-v2, and halfcheetah-v2. ... We additionally evaluate MAHALO and other baseline algorithms on five robot manipulation tasks from Meta World (Yu et al., 2020a).
Dataset Splits No The paper describes how the mixed-quality dynamics and expert datasets were constructed for training, but it does not specify explicit train/validation/test splits of these datasets for model training and evaluation in terms of percentages or sample counts. Evaluation is done by running the learned policy for a number of trials in the environment.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU models, CPU types, memory amounts) used for running its experiments.
Software Dependencies No The paper mentions using 'ADAM' optimizer and adapting code from 'light ATAC' but does not provide specific version numbers for these or other software dependencies like Python, PyTorch, or CUDA.
Experiment Setup Yes We use a fixed learning rate across all algorithms and experiments. Same as Cheng et al. (2022), we use ηslow = 5 10 7 for policy updates and ηfast = 5 10 4 for updating the critic (and reward for MAHALO). We use a common discount factor of γ = 0.99 for all experiments. For target update, we use τ = 0.005, same as Cheng et al. (2022); Haarnoja et al. (2018). We use a fixed batch size of 256. ... For all algorithms based on ATAC, we warm-start training with 1) behavior cloning the behavioral policy and 2) learning a critic function to match the value of the behavioral policy. For MAHALO, we additionally train the reward function in this phase.