SNAP: Self-Supervised Neural Maps for Visual Positioning and Semantic Understanding

Authors: Paul-Edouard Sarlin, Eduard Trulls, Marc Pollefeys, Jan Hosang, Simon Lynen

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We train and evaluate SNAP on a dataset with 50M Street View images from 5 continents, orders-of-magnitude larger and more diverse than comparable academic benchmarks.
Researcher Affiliation Collaboration Paul-Edouard Sarlin 1 Eduard Trulls 2 trulls@google.com Marc Pollefeys 1 marc.pollefeys@ethz.ch Jan Hosang 2 hosang@google.com Simon Lynen 2 slynen@google.com 1ETH Zurich 2Google Research
Pseudocode No The paper describes its methods in prose and uses diagrams (e.g., Figure 2, Figure 4) to illustrate architectures and data flow, but it does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes Work done during an internship at Google. Code available at github.com/google-research/snap
Open Datasets No Analytical use of Street View imagery was done with special permission from Google.
Dataset Splits Yes We train with 2.5M segments and 50M queries from 11 cities across the world: Barcelona, London, Paris (Europe), New York, San Francisco (North America), Rio de Janeiro (South America), Manila, Singapore, Taipei, Tokyo (Asia), and Sydney (Oceania), reserving some areas in each city for validation.
Hardware Specification Yes All models are trained on 16 A100 GPUs over 3-4 days with a total batch size of 32 (2 examples per GPU).
Software Dependencies No We develop our models with JAX [11] and Scenic [22], and format our dataset with TFDS [97]. All models are trained on 16 A100 GPUs over 3-4 days with a total batch size of 32 (2 examples per GPU). We use the ADAM [44] optimizer over 400k iterations for the small model and 200k iterations for the larger one.
Experiment Setup Yes In the ground-level encoder, ΦI is a U-Net [71] with a Bi T Res Net backbone [46], pre-trained as in [114], and an FPN decoder [51], initialized randomly. We consider two models with different backbones: a large R152x2 (353M parameters) and a small R50x1 (84M parameters). ΦOV is a similarly-defined R50x1+FPN. In multi-view fusion (Sec. 2.2) we use D=32 depth planes and K=60 height planes {zk} uniformly distributed within 12 m. Neural maps M and matching maps M have dimensions 128 and 32, respectively, and are defined over 64 16 m grids with 20 cm ground sample distance. Query BEVs have a maximum depth of 16 m. At training time, neural maps are built from one aerial tile and one SV segment, with each of the two randomly dropped, similarly to dropout [90]. We use a subset of N=20 views, some of them at a 60 angle, which we empirically found provides a good coverage/memory trade-off. See details in Appendix E.