SNAP: Self-Supervised Neural Maps for Visual Positioning and Semantic Understanding
Authors: Paul-Edouard Sarlin, Eduard Trulls, Marc Pollefeys, Jan Hosang, Simon Lynen
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We train and evaluate SNAP on a dataset with 50M Street View images from 5 continents, orders-of-magnitude larger and more diverse than comparable academic benchmarks. |
| Researcher Affiliation | Collaboration | Paul-Edouard Sarlin 1 Eduard Trulls 2 trulls@google.com Marc Pollefeys 1 marc.pollefeys@ethz.ch Jan Hosang 2 hosang@google.com Simon Lynen 2 slynen@google.com 1ETH Zurich 2Google Research |
| Pseudocode | No | The paper describes its methods in prose and uses diagrams (e.g., Figure 2, Figure 4) to illustrate architectures and data flow, but it does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Work done during an internship at Google. Code available at github.com/google-research/snap |
| Open Datasets | No | Analytical use of Street View imagery was done with special permission from Google. |
| Dataset Splits | Yes | We train with 2.5M segments and 50M queries from 11 cities across the world: Barcelona, London, Paris (Europe), New York, San Francisco (North America), Rio de Janeiro (South America), Manila, Singapore, Taipei, Tokyo (Asia), and Sydney (Oceania), reserving some areas in each city for validation. |
| Hardware Specification | Yes | All models are trained on 16 A100 GPUs over 3-4 days with a total batch size of 32 (2 examples per GPU). |
| Software Dependencies | No | We develop our models with JAX [11] and Scenic [22], and format our dataset with TFDS [97]. All models are trained on 16 A100 GPUs over 3-4 days with a total batch size of 32 (2 examples per GPU). We use the ADAM [44] optimizer over 400k iterations for the small model and 200k iterations for the larger one. |
| Experiment Setup | Yes | In the ground-level encoder, ΦI is a U-Net [71] with a Bi T Res Net backbone [46], pre-trained as in [114], and an FPN decoder [51], initialized randomly. We consider two models with different backbones: a large R152x2 (353M parameters) and a small R50x1 (84M parameters). ΦOV is a similarly-defined R50x1+FPN. In multi-view fusion (Sec. 2.2) we use D=32 depth planes and K=60 height planes {zk} uniformly distributed within 12 m. Neural maps M and matching maps M have dimensions 128 and 32, respectively, and are defined over 64 16 m grids with 20 cm ground sample distance. Query BEVs have a maximum depth of 16 m. At training time, neural maps are built from one aerial tile and one SV segment, with each of the two randomly dropped, similarly to dropout [90]. We use a subset of N=20 views, some of them at a 60 angle, which we empirically found provides a good coverage/memory trade-off. See details in Appendix E. |