Hybrid Reinforcement Learning from Offline Observation Alone
Authors: Yuda Song, Drew Bagnell, Aarti Singh
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We also perform proof-of-concept experiments that suggest the effectiveness of our algorithm in practice. and Empirical evaluation. We perform experiments to show the effectiveness of our algorithm on two challenging benchmarks: the rich-observation combination lock (Misra et al., 2020) and high-dimensional robotics manipulation tasks (Rajeswaran et al., 2017). We compare with the state-of-the-art hybrid RL algorithms and investigate the gap due to the more limited information in the offline dataset. |
| Researcher Affiliation | Collaboration | Yuda Song 1 J. Andrew Bagnell 1 2 Aarti Singh 1 1Carnegie Mellon University 2Aurora Innovation. |
| Pseudocode | Yes | Algorithm 1 FOward Observation-matching BAckward Reinforcement learning (FOOBAR), Algorithm 2 Policy Search by Dynamic Programming (PSDP), Algorithm 3 Policy Search by Dynamic Programming (PSDP) with trace model, Algorithm 4 Forward Adversarial Imitation Learning (FAIL), Algorithm 5 Min-Max Game, Algorithm 6 Conservative Policy Iteration (CPI) with trace model, Algorithm 7 Interactive Forward Adversarial Imitation Learning (Inter-FAIL). |
| Open Source Code | No | The paper does not contain an explicit statement or link indicating that the source code for the described methodology is publicly available. |
| Open Datasets | Yes | We use the following two benchmarks: the rich-observation combination lock (Misra et al., 2020) and high-dimensional robotics manipulation tasks (Rajeswaran et al., 2017). The visualization can be found in Figure 4. Both environments are challenging: ... Hammer. ... For the offline dataset construction of the hammer environment, we use the expert offline dataset provided in the d4rl benchmark. |
| Dataset Splits | No | The paper mentions sample sizes used for data collection and training (e.g., 'We collect 2000 samples per horizon'), but does not provide specific train/validation/test dataset splits (percentages or counts) or reference to standard splits for reproduction purposes. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used for running experiments, such as GPU/CPU models, processor types, or memory specifications. |
| Software Dependencies | No | The paper mentions software components like SAC and MMD with RBF Kernel, but does not provide specific version numbers for any of the software dependencies used in the experiments. |
| Experiment Setup | Yes | Hyperparameters for the combination lock experiment are presented in Table 3. and We provide the hyperparameter table in Table 4. These tables include specific values such as Offline sample size (per horizon) 2000, Min-max game iteration 1000, Learning rate 0.001, etc. |