$SE(3)$ Equivariant Ray Embeddings for Implicit Multi-View Depth Estimation
Authors: Yinshuang Xu, Dian Chen, Katherine Liu, Sergey Zakharov, Rareș Ambruș, Kostas Daniilidis, Vitor Guizilini
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 4 Experimental ResultsWe use Scan Net [12] and De Mo N [49] to validate our model on the task of stereo depth estimation. We compare our equivariant model with other state-of-the-art methods on stereo depth estimation, and report quantitative results in Table 1. We performed an ablation study on the geometric positional encodings, spherical harmonics encoding, equivariant attention, and the decoder architecture, and report the quantitative results in Table 3. |
| Researcher Affiliation | Collaboration | Yinshuang Xu University of Pennsylvania xuyin@seas.upenn.edu Dian Chen Toyota Research Institute dian.chen@tri.global Katherine Liu Toyota Research Institute katherine.liu@tri.global Sergey Zakharov Toyota Research Institute sergey.zakharov@tri.global Rares Ambrus Toyota Research Institute rares.ambrus@tri.global Kostas Daniilidis University of Pennsylvania kostas@cis.upenn.edu Vitor Guizilini Toyota Research Institute vitor.guizilini@tri.global |
| Pseudocode | No | The paper describes methods in text and figures (e.g., Figure 2 for architecture), but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | We used Pytorch to implement our Equivariant Perceiver IO and will opensource our code and pre-trained weights upon acceptance. |
| Open Datasets | Yes | We use Scan Net [12] and De Mo N [49] to validate our model on the task of stereo depth estimation. |
| Dataset Splits | Yes | For Scan Net, we use the same setting as [31], which downsamples scenes by a factor of 20 and splits them to obtain 94212 training and 7517 test pairs. The De Mo N dataset includes the SUN3D, RGBD-SLAM and Scenes11 datasets, where SUN3D and RGBD-SLAM are real world datasets and Scenes11 is a synthetic dataset. There are a total of 166285 training image pairs from 50420 scenes, and we use the same test split as [31] (80 pairs in SUN3D, 80 pairs in RGBD and 168 pairs in Scenes11). |
| Hardware Specification | Yes | Training and evaluation was conducted using distributed training (DDP) on 8 A100 GPUs, with 80 GB each. |
| Software Dependencies | No | We used Pytorch to implement our Equivariant Perceiver IO and will opensource our code and pre-trained weights upon acceptance. |
| Experiment Setup | Yes | We used a batch size of 192, the Adam W optimizer with β = 0.9, and β2 = 0.999, weight decay of 10 4, and an initial learning rate lr at 2 10 4. . For Scan Net, the training duration was 200 epochs, with the learning rate being reduced by half every 80 epochs; For De Mon datasets, the training duration was 200 epochs, with the learning rate being reduced by half every 80 epochs. We used the same losses as De Fi Ne, i.e., the L1-log loss, with a weight of 1.0 for real views and 0.2 for virtual views. |