Online Apprenticeship Learning
Authors: Lior Shani, Tom Zahavy, Shie Mannor8240-8248
AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Finally, we implement a deep variant of our algorithm which shares some similarities to GAIL, but where the discriminator is replaced with the costs learned by OAL. Our simulations suggest that OAL performs well in high dimensional control problems. |
| Researcher Affiliation | Collaboration | Lior Shani1, Tom Zahavy2, Shie Mannor1,3 1 Technion Israel Institute of Technology, Israel 2 Deepmind, UK 3 Nvidia Research, Israel |
| Pseudocode | Yes | Algorithm 1: OAL Scheme |
| Open Source Code | No | The paper refers to external open-source projects like 'Open AI Baselines' and 'Stable Baselines' in its references, which are tools used by the authors, but it does not provide a link or statement about the availability of its own source code for the methodology described in the paper. |
| Open Datasets | Yes | We evaluated deep OAL (Section 5) on the Mu Jo Co (Todorov, Erez, and Tassa 2012) set of continuous control tasks. |
| Dataset Splits | No | The paper states 'We used 10 expert trajectories in all our experiments' but does not specify any explicit training, validation, or test splits (e.g., percentages or sample counts) for this data or the environment interactions. |
| Hardware Specification | No | The paper does not provide any specific hardware details such as GPU/CPU models, processor types, or memory amounts used for running the experiments. |
| Software Dependencies | No | The paper mentions using MDPO, TRPO, Open AI Baselines, and Stable Baselines, but it does not specify version numbers for any of these software components to ensure reproducibility. |
| Experiment Setup | Yes | We used 10 expert trajectories in all our experiments, roughly the average amount in (Ho and Ermon 2016; Kostrikov et al. 2018). We tested OAL with both linear and neural costs (see Section 5), and compared them with GAIL. The same policy and cost networks were used for OAL and GAIL. Our theoretical analysis dictates to optimize the policy using stable updates. Thus, we used two policy optimization algorithms applying the MD update: (1) on-policy TRPO, which can be seen as a hard-constraint version of MDPO (Shani, Efroni, and Mannor 2020). (2) off-policy MDPO, which directly solves the policy updates in Eq. (3.2). ... On Lipschitz Costs. In Figure 3, we study the dependence on the Lipschitz regularization coefficient in the Half Cheetahv3 domain. |