Disentangling 3D Prototypical Networks for Few-Shot Concept Learning
Authors: Mihir Prabhudesai, Shamit Lal, Darshan Patil, Hsiao-Yu Tung, Adam W Harley, Katerina Fragkiadaki
ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We test D3DP-Nets in few-shot concept learning, visual question answering (VQA) and scene generation. We train concept classifiers for object shapes, object colors/materials, and spatial relationships on our inferred disentangled feature spaces, and show they outperform current stateof-the-art (Mao et al., 2019; Hu et al., 2016), which use 2D representations. We show that a VQA modular network that incorporates our concept classifiers shows improved generalization over the state-of-the-art (Mao et al., 2019) with dramatically fewer examples. Last, we empirically show that D3DP-Nets generalize their view predictions to scenes with novel number, category and styles of objects, and compare against state-of-the-art view predictive architectures of Eslami et al. (2018).Table 1: Five & one shot classification accuracy for shape and style concepts in CLEVR (Johnson et al., 2017), Real Veggie, and Replica datasets. |
| Researcher Affiliation | Academia | Mihir Prabhudesai 1, Shamit Lal 1, Darshan Patil 2, Hsiao-Yu Tung1, Adam W Harley1, Katerina Fragkiadaki1 1Carnegie Mellon University 2Mila, University of Montreal {mprabhud,shamitl}@cs.cmu.edu,darshan.patil@mila.quebec, {htung, aharley, katef}@cs.cmu.edu |
| Pseudocode | No | No pseudocode or algorithm blocks are present in the paper or its supplementary material. |
| Open Source Code | No | Project page: https://mihirp1998.github.io/project_pages/d3dp/. While a project page is provided, it is not explicitly stated to contain the source code, nor is it a direct link to a code repository as per the strict definition. |
| Open Datasets | Yes | We evaluate D3DP-Nets in its ability to classify shape and style concepts from few annotated examples on three datasets: i) CLEVR dataset (Johnson et al., 2017): ii) Real Veggie dataset: it is a real-world scene dataset we collected that contains 800 RGB-D scenes of vegetables placed on a table surface. iii) Replica dataset (Straub et al., 2019): it consists of 18 high quality reconstructions of indoor scenes. We use AI Habitat simulator (Manolis Savva* et al., 2019) to render multiview RGB-D data for it.CARLA Dataset. We use CARLA dataset to show detector improvement results in Appendix D. We use the 26 vehicle classes available in Carla 0.9.7 to prepare our dataset. |
| Dataset Splits | Yes | The first dataset is a support dataset containing 1200 scenes in the training split and 400 scenes in the validation split. For each scene, 12 different RGB-D views are generated (4 different azimuths, 3 different elevations). |
| Hardware Specification | Yes | Our model converges in 10-12hrs of training and requires 0.8 seconds for an inference step on a single RTX 2080. |
| Software Dependencies | No | No specific software dependencies with version numbers are provided. The paper mentions 'Adam optimizer (Kingma & Ba, 2014)' but no version. |
| Experiment Setup | Yes | The input RGB and depth images are resized to a resolution of 320 480 for all the datasets. While training using view prediction, we randomly sample 2 views from each multi-view scene. ... We train a 3D object detector that takes as input the output of the scene feature map M and predicts 3D axis-aligned bounding boxes, similar to Harley et al. (2020). ... Every VQA model is trained for 60 epochs with early stopping. We use the Adam optimizer (Kingma & Ba, 2014) initialized with a learning rate of .001. |