Hybrid Reinforcement Learning from Offline Observation Alone

Authors: Yuda Song, Drew Bagnell, Aarti Singh

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We also perform proof-of-concept experiments that suggest the effectiveness of our algorithm in practice. and Empirical evaluation. We perform experiments to show the effectiveness of our algorithm on two challenging benchmarks: the rich-observation combination lock (Misra et al., 2020) and high-dimensional robotics manipulation tasks (Rajeswaran et al., 2017). We compare with the state-of-the-art hybrid RL algorithms and investigate the gap due to the more limited information in the offline dataset.
Researcher Affiliation Collaboration Yuda Song 1 J. Andrew Bagnell 1 2 Aarti Singh 1 1Carnegie Mellon University 2Aurora Innovation.
Pseudocode Yes Algorithm 1 FOward Observation-matching BAckward Reinforcement learning (FOOBAR), Algorithm 2 Policy Search by Dynamic Programming (PSDP), Algorithm 3 Policy Search by Dynamic Programming (PSDP) with trace model, Algorithm 4 Forward Adversarial Imitation Learning (FAIL), Algorithm 5 Min-Max Game, Algorithm 6 Conservative Policy Iteration (CPI) with trace model, Algorithm 7 Interactive Forward Adversarial Imitation Learning (Inter-FAIL).
Open Source Code No The paper does not contain an explicit statement or link indicating that the source code for the described methodology is publicly available.
Open Datasets Yes We use the following two benchmarks: the rich-observation combination lock (Misra et al., 2020) and high-dimensional robotics manipulation tasks (Rajeswaran et al., 2017). The visualization can be found in Figure 4. Both environments are challenging: ... Hammer. ... For the offline dataset construction of the hammer environment, we use the expert offline dataset provided in the d4rl benchmark.
Dataset Splits No The paper mentions sample sizes used for data collection and training (e.g., 'We collect 2000 samples per horizon'), but does not provide specific train/validation/test dataset splits (percentages or counts) or reference to standard splits for reproduction purposes.
Hardware Specification No The paper does not provide specific details about the hardware used for running experiments, such as GPU/CPU models, processor types, or memory specifications.
Software Dependencies No The paper mentions software components like SAC and MMD with RBF Kernel, but does not provide specific version numbers for any of the software dependencies used in the experiments.
Experiment Setup Yes Hyperparameters for the combination lock experiment are presented in Table 3. and We provide the hyperparameter table in Table 4. These tables include specific values such as Offline sample size (per horizon) 2000, Min-max game iteration 1000, Learning rate 0.001, etc.