Equivariant Single View Pose Prediction Via Induced and Restriction Representations

Authors: Owen Howell, David Klee, Ondrej Biza, Linfeng Zhao, Robin Walters

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We test our architecture on three pose prediction tasks and achieve SOTA results on both the PASCAL3D+ and SYMSOL pose estimation tasks.
Researcher Affiliation Academia Owen Howell 1, David Klee2, Ondrej Biza2, Linfeng Zhao2, and Robin Walters2 1 Department of Electrical and Computer Engineering, Northeastern University 2 Khoury College of Computer Sciences, Northeastern University
Pseudocode No The paper does not contain any pseudocode or algorithm blocks. It describes the methods in text and mathematical formulations.
Open Source Code No The paper does not provide any explicit statements about releasing code or a link to a code repository.
Open Datasets Yes We evaluate the performance of our method on three single-object pose estimation datasets. These datasets require making predictions in SO(3) from single 2D images. SYMSOL [14] consists of a set of images of marked and unmarked geometric solids, taken from different vantage points. Training data is annotated with viewing direction. PASCAL3D+ [13] is a popular benchmark for object pose estimation composed of real images of objects from twelve categories.
Dataset Splits Yes To be consistent with the baselines, we augment the training data with synthetic renderings[45] and evaluate performance on the PASCALVOC_val set.
Hardware Specification Yes Numerical experiments were implemented on NVIDIA P-100 GPUs.
Software Dependencies No The paper mentions software packages like 'e2nn [38] package', 'e3nn [43] package', and 'Py Torch [46]' but does not specify their version numbers, which is required for reproducibility.
Experiment Setup Yes For the results presented in 6, we use a Res Net encoder with weights pre-trained on Image Net. With 224x224 images as input, this generates a 7x7 feature map with 2048 channels. The filters in the induction layer are instantiated using the e2nn [38] package. The maximum frequency is set at = 6. The output of the induction layer is a 64-channeled S2 signal with fibers transforming in the trivial representation of SO(3). After the induction layer, a spherical convolution operation is performed using a filter that is parameterized in the Fourier domain, which generates an 8-channel signal over SO(3). A spherical non-linearity is applied by mapping the signal to the spatial domain, applying a Re LU, then mapping back to the Fourier domain. One final spherical convolution with a locally supported filter is performed to generate a one-dimensional signal on SO(3). The output signal is queried using an SO(3) HEALPix grid (recursion level 3 during training, 5 during evaluation) and then normalized using a softmax following [14]. S2 and SO(3) convolutions were performed using the e3nn [43] package. The network was initialized and trained using Py Torch [46]. In order to create a fair comparison to existing baselines, batch size (64), number of epochs (40), optimizer (SGD) and learning rate schedule (Step LR) were chosen to be the same as that of [12].