Uni3D: Exploring Unified 3D Representation at Scale

Authors: Junsheng Zhou, Jinsheng Wang, Baorui Ma, Yu-Shen Liu, Tiejun Huang, Xinlong Wang

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We efficiently scale up Uni3D to one billion parameters, and set new records on a broad range of 3D tasks, such as zero-shot classification, few-shot classification, open-world understanding and part segmentation. We first evaluate Uni3D under the zero-shot shape classification task. We conduct experiments under three benchmarks: Model Net (Wu et al., 2015), Scan Obj NN (Uy et al., 2019) and Objaverse-LVIS (Deitke et al., 2023b).
Researcher Affiliation Collaboration Junsheng Zhou1,2 Jinsheng Wang1 Baorui Ma1 Yu-Shen Liu2 Tiejun Huang1,3 Xinlong Wang1 1 Beijing Academy of Artificial Intelligence 2 Tsinghua University 3 Peking University
Pseudocode No The paper does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks, nor does it present any structured steps formatted like code.
Open Source Code Yes Code & Models: https://github.com/baaivision/Uni3D
Open Datasets Yes In order to keep the experimental settings consistent with other methods for a fair comparison, we adopt the ensembled 3D dataset provided by Open Shape for training, which consists of four 3D dataset, i.e., Objaverse (Deitke et al., 2023b), Shape Net (Chang et al., 2015), 3D-FUTURE (Fu et al., 2021) and ABO (Collins et al., 2022).
Dataset Splits Yes We follow the settings of Open Shape (Liu et al., 2023) to conduct evaluations. We conduct few-shot linear probing under the difficult Objaverse-LVIS dataset with labeled training samples per class from 1, 2, 4, 8 to 16. For Objaverse-LVIS, we use 10,000 sampled colored points as input. For Model Net40, we utilize 10,000 sampled points without color as input. For Scan Obj NN, the input is 2,048 sampled points without color from the OBJ ONLY version.
Hardware Specification Yes Taking advantage of the aforementioned strategies, our largest model, i.e., Uni3D-g with one billion parameters, converges in approximately 20 hours with 24 NVIDIA-A100-SXM4-40GB GPUs.
Software Dependencies No The paper mentions software components like 'Adam optimizer', 'FLIP technique', and 'Deep Speed', but it does not specify any version numbers for these or other software libraries (e.g., 'PyTorch 1.9', 'Deep Speed 0.5.0').
Experiment Setup Yes We employ the Adam (Kingma & Ba, 2014) optimizer with a peak learning rate of 1e-3 that gradually decreases following a cosine learning rate schedule. To enhance training stability, we adopt stochastic depth (Huang et al., 2016) regularization. We also leverage the FLIP (Li et al., 2023b) technique, which randomly masks 50% of point tokens during training, reducing time complexity by half. We precache text and image CLIP embeddings of all shapes, allowing us to increase the total batch size to 1152 and greatly accelerating training. To further improve the training process, we adopt Deep Speed (Rasley et al., 2020) with Ze RO stage-1 optimizer and fp16 precision with dynamic loss scaling (Rajbhandari et al., 2020).