reproducibilityindex.ai

Massively Scalable Inverse Reinforcement Learning in Google Maps

Authors: Matt Barnes, Matthew Abueg, Oliver F. Lange, Matt Deeds, Jason Trader, Denali Molitor, Markus Wulfmeier, Shawn O'Banion

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our contributions culminate in a policy that achieves a 16-24% improvement in route quality at a global scale, and to the best of our knowledge, represents the largest published study of IRL algorithms in a real-world setting to date. We conclude by conducting an ablation study of key components, presenting negative results from alternative eigenvalue solvers, and identifying opportunities to further improve scalability via IRL-speciﬁc batching strategies. We train the ﬁnal global policy for 1.4 GPU-years on a large cluster of V100 machines, which results in a signiﬁcant 15.9% and 24.1% increase in route accuracy relative to the ETA+penalties baseline models for driving and two-wheelers, respectively.
Researcher Affiliation	Industry	Matt Barnes1 Matthew Abueg1 Oliver F. Lange2 Matt Deeds2 Jason Trader2 Denali Molitor1 Markus Wulfmeier3 Shawn O Banion1 1Google Research 2Google Maps 3Google Deep Mind
Pseudocode	Yes	Algorithm 1 RHIP (Receding Horizon Inverse Planning) Input: Current reward rθ, horizon H, demonstration τ with origin so and destination sd Output: Parameter update θf; Algorithm 2 Max Ent++ and Max Ent Input: Reward rθ, demonstration τ with origin so and destination sd Output: Parameter update θf; Algorithm 3 Bayesian IRL Input: Reward rθ, demonstration τ with origin so and destination sd Output: Parameter update θf; Algorithm 4 Max Margin Planning (MMP) Input: Reward rθ, demonstration τ with origin so and destination sd Output: Parameter update θf
Open Source Code	No	We strongly support open-sourcing experiments when legally and ethically permissible. The routing dataset used in this paper and likely any routing dataset of similar scope cannot be publicly released due to strict user privacy requirements.
Open Datasets	No	The routing dataset used in this paper and likely any routing dataset of similar scope cannot be publicly released due to strict user privacy requirements. Demonstration dataset Dataset D contains de-identiﬁed users trips collected during active navigation mode (Google, 2024).
Dataset Splits	Yes	The dataset is a ﬁxed-size subsample of these routes, spanning a period of two weeks and evenly split into training and evaluation sets based on date. Separate datasets are created for driving and two-wheelers, with the two-wheeler (e.g. mopeds, scooters) dataset being signiﬁcantly smaller than the drive dataset due to a smaller region where this feature is available. The total number of iterated training and validation demonstration routes are 110M and 10M, respectively.
Hardware Specification	Yes	We train the ﬁnal global policy for 1.4 GPU-years on a large cluster of V100 machines, which results in a signiﬁcant 15.9% and 24.1% increase in route accuracy relative to the ETA+penalties baseline models for driving and two-wheelers, respectively.
Software Dependencies	No	The paper mentions 'ARPACK' (Lehoucq et al., 1998) and 'UMFPACK' (Davis, 2004) in the context of negative results and specific experiments in the appendix. However, it does not provide specific version numbers for any core software dependencies (like deep learning frameworks, libraries, or optimizers like SGD or Adam) that would be needed to replicate the main experimental setup.
Experiment Setup	Yes	Table 4: Hyperparameters. The table includes specific values for 'Policy Softmax temperature 10, 20, 30', 'Horizon 2, 10, 100', 'Reward model Hidden layer width 18', 'Hidden layer depth 2', 'Optimizer SGD learning rate 0.05, 0.01', 'Adam learning rate 1e-5, 1e-6', 'Batch size 8', and 'Epochs 200, 400'.