DD-PPO: Learning Near-Perfect PointGoal Navigators from 2.5 Billion Frames
Authors: Erik Wijmans, Abhishek Kadian, Ari Morcos, Stefan Lee, Irfan Essa, Devi Parikh, Manolis Savva, Dhruv Batra
ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We leverage this scaling to train an agent for 2.5 Billion steps of experience (the equivalent of 80 years of human experience) over 6 months of GPU-time training in under 3 days of wall-clock time with 64 GPUs. This massive-scale training not only sets the state of art on Habitat Autonomous Navigation Challenge 2019, but essentially solves the task near-perfect autonomous navigation in an unseen environment without access to a map, directly from an RGB-D camera and a GPS+Compass sensor. Fortuitously, error vs computation exhibits a power-law-like distribution; thus, 90% of peak performance is obtained relatively early (at 100 million steps) and relatively cheaply (under 1 day with 8 GPUs). Finally, we show that the scene understanding and navigation policies learned can be transferred to other navigation tasks the analog of Image Net pre-training + task-specific fine-tuning for embodied AI. |
| Researcher Affiliation | Collaboration | 1Georgia Institute of Technology 2Facebook AI Research 3Oregon State University 4Simon Fraser University |
| Pseudocode | Yes | See Fig. 9 for an example implementation which adds 1) gradient synchronization via torch.nn.parallel.Distributed Data Parallel, and 2) preempts stragglers by tracking the number of workers have finished the experience collection stage with a torch.distributed.TCPStore. |
| Open Source Code | Yes | Our model outperforms Image Net pre-trained CNNs on these transfer tasks and can serve as a universal resource (all models and code are publicly available). Code: https://github.com/facebookresearch/habitat-api |
| Open Datasets | Yes | First, we utilize the training data released as part of the Habitat Challenge 2019, consisting of 72 scenes from the Gibson dataset (Xia et al., 2018). We then augment this with all 90 scenes in the Matterport3D dataset (Chang et al., 2017) to create a larger training set (note that Matterport3D meshes tend to be larger and of better quality). |
| Dataset Splits | Yes | Table 1: Performance (higher is better) of different architectures for agents with RGB-D and GPS+Compass sensors on the Habitat Challenge 2019 (Savva et al., 2019) validation and test-std splits (checkpoint selected on val). |
| Hardware Specification | Yes | We benchmark training our Res Net50 Point Goal Nav agent with Depth on a cluster with Nvidia V100 GPUs and NCCL2.4.7 with Infiniband interconnect. |
| Software Dependencies | Yes | We leverage Py Torch s (Paszke et al., 2017) Distributed Data Parallel to synchronize gradients, and TCPStore a simple distributed key-value storage to track how many workers have finished collecting experience. See Apx. E for a detailed description with code. See Fig. 9 for an example implementation which adds 1) gradient synchronization via torch.nn.parallel.Distributed Data Parallel, and 2) preempts stragglers by tracking the number of workers have finished the experience collection stage with a torch.distributed.TCPStore. See Fig. 9 for an example implementation which adds 1) gradient synchronization via torch.nn.parallel.Distributed Data Parallel, and 2) preempts stragglers by tracking the number of workers have finished the experience collection stage with a torch.distributed.TCPStore. |
| Experiment Setup | Yes | Training. We use PPO with Generalized Advantage Estimation (Schulman et al., 2015). We set the discount factor γ to 0.99 and the GAE parameter τ to 0.95. Each worker collects (up to) 128 frames of experience from 4 agents running in parallel (all in different environments) and then performs 2 epochs of PPO with 2 mini-batches per epoch. We use Adam (Kingma & Ba, 2014) with a learning rate of 2.5 10 4. Unlike popular implementations of PPO, we do not normalize advantages as we find this leads to instabilities. We use DD-PPO to train with 64 workers on 64 GPUs. |