Online Apprenticeship Learning

Authors: Lior Shani, Tom Zahavy, Shie Mannor8240-8248

AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Finally, we implement a deep variant of our algorithm which shares some similarities to GAIL, but where the discriminator is replaced with the costs learned by OAL. Our simulations suggest that OAL performs well in high dimensional control problems.
Researcher Affiliation Collaboration Lior Shani1, Tom Zahavy2, Shie Mannor1,3 1 Technion Israel Institute of Technology, Israel 2 Deepmind, UK 3 Nvidia Research, Israel
Pseudocode Yes Algorithm 1: OAL Scheme
Open Source Code No The paper refers to external open-source projects like 'Open AI Baselines' and 'Stable Baselines' in its references, which are tools used by the authors, but it does not provide a link or statement about the availability of its own source code for the methodology described in the paper.
Open Datasets Yes We evaluated deep OAL (Section 5) on the Mu Jo Co (Todorov, Erez, and Tassa 2012) set of continuous control tasks.
Dataset Splits No The paper states 'We used 10 expert trajectories in all our experiments' but does not specify any explicit training, validation, or test splits (e.g., percentages or sample counts) for this data or the environment interactions.
Hardware Specification No The paper does not provide any specific hardware details such as GPU/CPU models, processor types, or memory amounts used for running the experiments.
Software Dependencies No The paper mentions using MDPO, TRPO, Open AI Baselines, and Stable Baselines, but it does not specify version numbers for any of these software components to ensure reproducibility.
Experiment Setup Yes We used 10 expert trajectories in all our experiments, roughly the average amount in (Ho and Ermon 2016; Kostrikov et al. 2018). We tested OAL with both linear and neural costs (see Section 5), and compared them with GAIL. The same policy and cost networks were used for OAL and GAIL. Our theoretical analysis dictates to optimize the policy using stable updates. Thus, we used two policy optimization algorithms applying the MD update: (1) on-policy TRPO, which can be seen as a hard-constraint version of MDPO (Shani, Efroni, and Mannor 2020). (2) off-policy MDPO, which directly solves the policy updates in Eq. (3.2). ... On Lipschitz Costs. In Figure 3, we study the dependence on the Lipschitz regularization coefficient in the Half Cheetahv3 domain.