reproducibilityindex.ai

Leveraging VLM-Based Pipelines to Annotate 3D Objects

Authors: Rishabh Kabra, Loic Matthey, Alexander Lerchner, Niloy Mitra

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We show our probabilistic aggregation is not only more reliable and efﬁcient, but sets the So TA on inferring object types with respect to human-veriﬁed labels. The aggregated annotations are also useful for conditional inference; they improve downstream predictions (e.g., of object material) when the object s type is speciﬁed as an auxiliary text-based input. Such auxiliary inputs allow ablating the contribution of visual reasoning over visionless reasoning in an unsupervised setting. With these supervised and unsupervised evaluations, we show how a VLM-based pipeline can be leveraged to produce reliable annotations for 764K objects from the Objaverse dataset.
Researcher Affiliation	Collaboration	1Google Deep Mind 2University College London. Correspondence to: Rishabh Kabra <rkabra@google.com>.
Pseudocode	No	No explicitly labeled 'Pseudocode' or 'Algorithm' block was found. The aggregation method is described using mathematical equations (Eq 1, 2, 3) and textual explanation.
Open Source Code	Yes	Our salient contributions are the following we: ... 5. Are releasing 5M aggregated captions and annotations for Objaverse. These are available via our project page. ... We attach the following ﬁles as supplementary materials: ... 5. Python code for our score-based, multi-probe aggregation method. 6. Code diff for BLIP-2 to highlight our minimal changes to generate outputs with scores and run in LLM mode.
Open Datasets	Yes	Dataset. Our main target is Objaverse 1.0 (Deitke et al., 2023), an internet-scale collection of 800K diverse but poorly annotated 3D models. They were uploaded by 100K artists to the Sketchfab platform. While the uploaded tags and descriptions are inconsistent and unreliable, a subset of 44K objects called Objaverse-LVIS is accompanied by human-veriﬁed categories. We rely on it to validate our semantic annotations. We also introduce a subset with material labels to test material inference.
Dataset Splits	No	The paper uses Objaverse-LVIS for validation and mentions a 'material test set' but does not specify explicit train/validation/test percentage splits or absolute sample counts for each split in the main text. It states 'We rely on it to validate our semantic annotations' regarding Objaverse-LVIS and 'We collect four sets of semantic descriptions for Objaverse: ... We compare outputs from these sources to human-veriﬁed object categories from Objaverse-LVIS.'
Hardware Specification	No	No specific hardware details (e.g., CPU, GPU models, memory, or cloud instance types) used for running the experiments were provided in the paper. It mentions 'Pa LI-X 55B VQA' and 'BLIP-2 T5 XL' as models, implying computational resources were used, but no specifications.
Software Dependencies	Yes	We downloaded 798,759 Objaverse GLB ﬁles and rendered them using Blender 3.4 (Community, 2018). ... The model (Chen et al., 2023b) is based on the ﬂaxformer transformer (Vaswani et al., 2017) library and t5x training/evaluation infrastructure (Roberts et al., 2022), both written and released in jax (Frostig et al., 2018). ... The language backbone relies on a Sentence Piece tokenizer (Kudo & Richardson, 2018) with vocab size 250K available here. ... As CAP3D did, we use BLIP-2 from LAVIS (Li et al., 2022a), which is based on the widely used Py Torch transformers library.
Experiment Setup	Yes	We applied Score Agg to summarize J = 5 responses across I = 8 views per object. ... This produced 4 sets of top-5 responses per view (I = 4 8 = 32, J = 5). ... Pa LI scoring is length normalized as originally described in Eq 14 of (Wu et al., 2016) or coded in t5x here. This is to help ensure that longer outputs are not disadvantaged. We kept the length norm parameter ﬁxed at α = 0.6 and used default beam searching sampling with 5 parallel decodings. ... We placed each object at the origin and scaled its maximum dimension to 1. We then rotated the camera at a ﬁxed height and distance to the origin, rendering images at azimuthal intervals of 45 degrees. To determine the camera height, we swept over a few values of the polar angle θ w.r.t. the z-axis. We presented this sweep and other rendering hyperparameters (such as lighting conditions) in Table 4 in the main text.