Massively Scalable Inverse Reinforcement Learning in Google Maps
Authors: Matt Barnes, Matthew Abueg, Oliver F. Lange, Matt Deeds, Jason Trader, Denali Molitor, Markus Wulfmeier, Shawn O'Banion
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our contributions culminate in a policy that achieves a 16-24% improvement in route quality at a global scale, and to the best of our knowledge, represents the largest published study of IRL algorithms in a real-world setting to date. We conclude by conducting an ablation study of key components, presenting negative results from alternative eigenvalue solvers, and identifying opportunities to further improve scalability via IRL-specific batching strategies. We train the final global policy for 1.4 GPU-years on a large cluster of V100 machines, which results in a significant 15.9% and 24.1% increase in route accuracy relative to the ETA+penalties baseline models for driving and two-wheelers, respectively. |
| Researcher Affiliation | Industry | Matt Barnes1 Matthew Abueg1 Oliver F. Lange2 Matt Deeds2 Jason Trader2 Denali Molitor1 Markus Wulfmeier3 Shawn O Banion1 1Google Research 2Google Maps 3Google Deep Mind |
| Pseudocode | Yes | Algorithm 1 RHIP (Receding Horizon Inverse Planning) Input: Current reward rθ, horizon H, demonstration τ with origin so and destination sd Output: Parameter update θf; Algorithm 2 Max Ent++ and Max Ent Input: Reward rθ, demonstration τ with origin so and destination sd Output: Parameter update θf; Algorithm 3 Bayesian IRL Input: Reward rθ, demonstration τ with origin so and destination sd Output: Parameter update θf; Algorithm 4 Max Margin Planning (MMP) Input: Reward rθ, demonstration τ with origin so and destination sd Output: Parameter update θf |
| Open Source Code | No | We strongly support open-sourcing experiments when legally and ethically permissible. The routing dataset used in this paper and likely any routing dataset of similar scope cannot be publicly released due to strict user privacy requirements. |
| Open Datasets | No | The routing dataset used in this paper and likely any routing dataset of similar scope cannot be publicly released due to strict user privacy requirements. Demonstration dataset Dataset D contains de-identified users trips collected during active navigation mode (Google, 2024). |
| Dataset Splits | Yes | The dataset is a fixed-size subsample of these routes, spanning a period of two weeks and evenly split into training and evaluation sets based on date. Separate datasets are created for driving and two-wheelers, with the two-wheeler (e.g. mopeds, scooters) dataset being significantly smaller than the drive dataset due to a smaller region where this feature is available. The total number of iterated training and validation demonstration routes are 110M and 10M, respectively. |
| Hardware Specification | Yes | We train the final global policy for 1.4 GPU-years on a large cluster of V100 machines, which results in a significant 15.9% and 24.1% increase in route accuracy relative to the ETA+penalties baseline models for driving and two-wheelers, respectively. |
| Software Dependencies | No | The paper mentions 'ARPACK' (Lehoucq et al., 1998) and 'UMFPACK' (Davis, 2004) in the context of negative results and specific experiments in the appendix. However, it does not provide specific version numbers for any core software dependencies (like deep learning frameworks, libraries, or optimizers like SGD or Adam) that would be needed to replicate the main experimental setup. |
| Experiment Setup | Yes | Table 4: Hyperparameters. The table includes specific values for 'Policy Softmax temperature 10, 20, 30', 'Horizon 2, 10, 100', 'Reward model Hidden layer width 18', 'Hidden layer depth 2', 'Optimizer SGD learning rate 0.05, 0.01', 'Adam learning rate 1e-5, 1e-6', 'Batch size 8', and 'Epochs 200, 400'. |