Disentangling 3D Prototypical Networks for Few-Shot Concept Learning

Authors: Mihir Prabhudesai, Shamit Lal, Darshan Patil, Hsiao-Yu Tung, Adam W Harley, Katerina Fragkiadaki

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We test D3DP-Nets in few-shot concept learning, visual question answering (VQA) and scene generation. We train concept classifiers for object shapes, object colors/materials, and spatial relationships on our inferred disentangled feature spaces, and show they outperform current stateof-the-art (Mao et al., 2019; Hu et al., 2016), which use 2D representations. We show that a VQA modular network that incorporates our concept classifiers shows improved generalization over the state-of-the-art (Mao et al., 2019) with dramatically fewer examples. Last, we empirically show that D3DP-Nets generalize their view predictions to scenes with novel number, category and styles of objects, and compare against state-of-the-art view predictive architectures of Eslami et al. (2018).Table 1: Five & one shot classification accuracy for shape and style concepts in CLEVR (Johnson et al., 2017), Real Veggie, and Replica datasets.
Researcher Affiliation Academia Mihir Prabhudesai 1, Shamit Lal 1, Darshan Patil 2, Hsiao-Yu Tung1, Adam W Harley1, Katerina Fragkiadaki1 1Carnegie Mellon University 2Mila, University of Montreal {mprabhud,shamitl}@cs.cmu.edu,darshan.patil@mila.quebec, {htung, aharley, katef}@cs.cmu.edu
Pseudocode No No pseudocode or algorithm blocks are present in the paper or its supplementary material.
Open Source Code No Project page: https://mihirp1998.github.io/project_pages/d3dp/. While a project page is provided, it is not explicitly stated to contain the source code, nor is it a direct link to a code repository as per the strict definition.
Open Datasets Yes We evaluate D3DP-Nets in its ability to classify shape and style concepts from few annotated examples on three datasets: i) CLEVR dataset (Johnson et al., 2017): ii) Real Veggie dataset: it is a real-world scene dataset we collected that contains 800 RGB-D scenes of vegetables placed on a table surface. iii) Replica dataset (Straub et al., 2019): it consists of 18 high quality reconstructions of indoor scenes. We use AI Habitat simulator (Manolis Savva* et al., 2019) to render multiview RGB-D data for it.CARLA Dataset. We use CARLA dataset to show detector improvement results in Appendix D. We use the 26 vehicle classes available in Carla 0.9.7 to prepare our dataset.
Dataset Splits Yes The first dataset is a support dataset containing 1200 scenes in the training split and 400 scenes in the validation split. For each scene, 12 different RGB-D views are generated (4 different azimuths, 3 different elevations).
Hardware Specification Yes Our model converges in 10-12hrs of training and requires 0.8 seconds for an inference step on a single RTX 2080.
Software Dependencies No No specific software dependencies with version numbers are provided. The paper mentions 'Adam optimizer (Kingma & Ba, 2014)' but no version.
Experiment Setup Yes The input RGB and depth images are resized to a resolution of 320 480 for all the datasets. While training using view prediction, we randomly sample 2 views from each multi-view scene. ... We train a 3D object detector that takes as input the output of the scene feature map M and predicts 3D axis-aligned bounding boxes, similar to Harley et al. (2020). ... Every VQA model is trained for 60 epochs with early stopping. We use the Adam optimizer (Kingma & Ba, 2014) initialized with a learning rate of .001.