Unsupervised Learning of Shape and Pose with Differentiable Point Clouds

Authors: Eldar Insafutdinov, Alexey Dosovitskiy

NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate the proposed approach on the task of estimating the shape and the camera pose from a single image of an object 2. The method successfully learns to predict both the shape and the pose, with only a minor performance drop relative to a model trained with ground truth camera poses. The point-cloud-based formulation allows for effective learning of high-fidelity shape models when provided with images of sufficiently high resolution as supervision. We demonstrate learning point clouds from silhouettes and augmenting those with color if color images are available during training. Finally, we show how the point cloud representation allows to automatically discover semantic correspondences between objects.
Researcher Affiliation Collaboration Eldar Insafutdinov Max Planck Institute for Informatics eldar@mpi-inf.mpg.de Alexey Dosovitskiy Intel Labs adosovitskiy@gmail.com
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code Yes The project website with code can be found at https://eldar.github.io/Point Clouds/.
Open Datasets Yes Datasets. We conduct the experiments on 3D models from the Shape Net [3] dataset. We focus on 3 categories typically used in related work: chairs, cars, and airplanes. We follow the train/test protocol and the data generation procedure of Tulsiani et al. [20]: split the models into training, validation and test sets and render 5 random views of each model with random light source positions and random camera azimuth and elevation, sampled uniformly from [0 , 360 ) and [ 20 , 40 ] respectively.
Dataset Splits Yes Datasets. We conduct the experiments on 3D models from the Shape Net [3] dataset. We focus on 3 categories typically used in related work: chairs, cars, and airplanes. We follow the train/test protocol and the data generation procedure of Tulsiani et al. [20]: split the models into training, validation and test sets and render 5 random views of each model with random light source positions and random camera azimuth and elevation, sampled uniformly from [0 , 360 ) and [ 20 , 40 ] respectively.
Hardware Specification No The paper mentions GPU memory (12Gb) but does not provide specific details on the GPU model, CPU, or other hardware used for the experiments. It only states 'does not fit into 12Gb of GPU memory with our batch size'.
Software Dependencies No The paper mentions using TensorFlow [1] and Adam optimizer [9] but does not provide specific version numbers for any software dependencies.
Experiment Setup Yes Training details. We trained the networks using the Adam optimizer [9], for 600,000 mini-batch iterations. We used mini-batches of 16 samples (4 views of 4 objects). We used a fixed learning rate of 0.0001 and the standard momentum parameters. We used the fast projection in most experiments, unless mentioned otherwise. We varied both the number of points in the point cloud and the resolution of the volume used in the projection operation depending on the resolution of the ground truth projections used for supervision. We used the volume with the same side as the training samples (e.g., 643 volume for 642 projections), and we used 2000 points for 322 projections, 8000 points for 642 projections, and 16,000 points for 1282 projections.