Leveraging VLM-Based Pipelines to Annotate 3D Objects

Authors: Rishabh Kabra, Loic Matthey, Alexander Lerchner, Niloy Mitra

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show our probabilistic aggregation is not only more reliable and efficient, but sets the So TA on inferring object types with respect to human-verified labels. The aggregated annotations are also useful for conditional inference; they improve downstream predictions (e.g., of object material) when the object s type is specified as an auxiliary text-based input. Such auxiliary inputs allow ablating the contribution of visual reasoning over visionless reasoning in an unsupervised setting. With these supervised and unsupervised evaluations, we show how a VLM-based pipeline can be leveraged to produce reliable annotations for 764K objects from the Objaverse dataset.
Researcher Affiliation Collaboration 1Google Deep Mind 2University College London. Correspondence to: Rishabh Kabra <rkabra@google.com>.
Pseudocode No No explicitly labeled 'Pseudocode' or 'Algorithm' block was found. The aggregation method is described using mathematical equations (Eq 1, 2, 3) and textual explanation.
Open Source Code Yes Our salient contributions are the following we: ... 5. Are releasing 5M aggregated captions and annotations for Objaverse. These are available via our project page. ... We attach the following files as supplementary materials: ... 5. Python code for our score-based, multi-probe aggregation method. 6. Code diff for BLIP-2 to highlight our minimal changes to generate outputs with scores and run in LLM mode.
Open Datasets Yes Dataset. Our main target is Objaverse 1.0 (Deitke et al., 2023), an internet-scale collection of 800K diverse but poorly annotated 3D models. They were uploaded by 100K artists to the Sketchfab platform. While the uploaded tags and descriptions are inconsistent and unreliable, a subset of 44K objects called Objaverse-LVIS is accompanied by human-verified categories. We rely on it to validate our semantic annotations. We also introduce a subset with material labels to test material inference.
Dataset Splits No The paper uses Objaverse-LVIS for validation and mentions a 'material test set' but does not specify explicit train/validation/test percentage splits or absolute sample counts for each split in the main text. It states 'We rely on it to validate our semantic annotations' regarding Objaverse-LVIS and 'We collect four sets of semantic descriptions for Objaverse: ... We compare outputs from these sources to human-verified object categories from Objaverse-LVIS.'
Hardware Specification No No specific hardware details (e.g., CPU, GPU models, memory, or cloud instance types) used for running the experiments were provided in the paper. It mentions 'Pa LI-X 55B VQA' and 'BLIP-2 T5 XL' as models, implying computational resources were used, but no specifications.
Software Dependencies Yes We downloaded 798,759 Objaverse GLB files and rendered them using Blender 3.4 (Community, 2018). ... The model (Chen et al., 2023b) is based on the flaxformer transformer (Vaswani et al., 2017) library and t5x training/evaluation infrastructure (Roberts et al., 2022), both written and released in jax (Frostig et al., 2018). ... The language backbone relies on a Sentence Piece tokenizer (Kudo & Richardson, 2018) with vocab size 250K available here. ... As CAP3D did, we use BLIP-2 from LAVIS (Li et al., 2022a), which is based on the widely used Py Torch transformers library.
Experiment Setup Yes We applied Score Agg to summarize J = 5 responses across I = 8 views per object. ... This produced 4 sets of top-5 responses per view (I = 4 8 = 32, J = 5). ... Pa LI scoring is length normalized as originally described in Eq 14 of (Wu et al., 2016) or coded in t5x here. This is to help ensure that longer outputs are not disadvantaged. We kept the length norm parameter fixed at α = 0.6 and used default beam searching sampling with 5 parallel decodings. ... We placed each object at the origin and scaled its maximum dimension to 1. We then rotated the camera at a fixed height and distance to the origin, rendering images at azimuthal intervals of 45 degrees. To determine the camera height, we swept over a few values of the polar angle θ w.r.t. the z-axis. We presented this sweep and other rendering hyperparameters (such as lighting conditions) in Table 4 in the main text.