Equivariant Single View Pose Prediction Via Induced and Restriction Representations
Authors: Owen Howell, David Klee, Ondrej Biza, Linfeng Zhao, Robin Walters
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We test our architecture on three pose prediction tasks and achieve SOTA results on both the PASCAL3D+ and SYMSOL pose estimation tasks. |
| Researcher Affiliation | Academia | Owen Howell 1, David Klee2, Ondrej Biza2, Linfeng Zhao2, and Robin Walters2 1 Department of Electrical and Computer Engineering, Northeastern University 2 Khoury College of Computer Sciences, Northeastern University |
| Pseudocode | No | The paper does not contain any pseudocode or algorithm blocks. It describes the methods in text and mathematical formulations. |
| Open Source Code | No | The paper does not provide any explicit statements about releasing code or a link to a code repository. |
| Open Datasets | Yes | We evaluate the performance of our method on three single-object pose estimation datasets. These datasets require making predictions in SO(3) from single 2D images. SYMSOL [14] consists of a set of images of marked and unmarked geometric solids, taken from different vantage points. Training data is annotated with viewing direction. PASCAL3D+ [13] is a popular benchmark for object pose estimation composed of real images of objects from twelve categories. |
| Dataset Splits | Yes | To be consistent with the baselines, we augment the training data with synthetic renderings[45] and evaluate performance on the PASCALVOC_val set. |
| Hardware Specification | Yes | Numerical experiments were implemented on NVIDIA P-100 GPUs. |
| Software Dependencies | No | The paper mentions software packages like 'e2nn [38] package', 'e3nn [43] package', and 'Py Torch [46]' but does not specify their version numbers, which is required for reproducibility. |
| Experiment Setup | Yes | For the results presented in 6, we use a Res Net encoder with weights pre-trained on Image Net. With 224x224 images as input, this generates a 7x7 feature map with 2048 channels. The filters in the induction layer are instantiated using the e2nn [38] package. The maximum frequency is set at = 6. The output of the induction layer is a 64-channeled S2 signal with fibers transforming in the trivial representation of SO(3). After the induction layer, a spherical convolution operation is performed using a filter that is parameterized in the Fourier domain, which generates an 8-channel signal over SO(3). A spherical non-linearity is applied by mapping the signal to the spatial domain, applying a Re LU, then mapping back to the Fourier domain. One final spherical convolution with a locally supported filter is performed to generate a one-dimensional signal on SO(3). The output signal is queried using an SO(3) HEALPix grid (recursion level 3 during training, 5 during evaluation) and then normalized using a softmax following [14]. S2 and SO(3) convolutions were performed using the e3nn [43] package. The network was initialized and trained using Py Torch [46]. In order to create a fair comparison to existing baselines, batch size (64), number of epochs (40), optimizer (SGD) and learning rate schedule (Step LR) were chosen to be the same as that of [12]. |