Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
LabelAny3D: Label Any Object 3D in the Wild
Authors: Jin Yao, Radowan Mahmud Redoy, Sebastian Elbaum, Matthew Dwyer, Zezhou Cheng
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments show that annotations generated by Label Any3D improve monocular 3D detection performance across multiple benchmarks, outperforming prior auto-labeling approaches in quality. These results demonstrate the promise of foundation-model-driven annotation for scaling up 3D recognition in realistic, open-world settings. |
| Researcher Affiliation | Academia | Jin Yao Radowan Mahmud Redoy Sebastian Elbaum Matthew B. Dwyer Zezhou Cheng University of Virginia |
| Pseudocode | No | The paper describes the Label Any3D pipeline in Section 3 and illustrates it in Figure 3, but it does not present structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | We publicly release our code and data on our project page. https://uva-computer-vision-lab.github.io/Label Any3D/ |
| Open Datasets | Yes | We introduce COCO3D, a new benchmark for open-vocabulary monocular 3D detection, derived from the MS-COCO dataset and covering a wide range of object categories absent from existing 3D datasets. The COCO3D benchmark comprises 2,039 human-refined images with a total of 5,373 instances spanning all 80 categories of the MS-COCO dataset [40]. We also assess the open-vocabulary detection capabilities of our trained model on Omni3D [7], which primarily encompasses indoor datasets such as SUN RGB-D [63], ARKit Scenes [4], and Hypersim [58]; the object-centric dataset Objectron [2]; and autonomous driving datasets including nu Scenes [8] and KITTI [24]. |
| Dataset Splits | Yes | The COCO3D benchmark comprises 2,039 human-refined images... We construct this benchmark using the validation set of the MS-COCO dataset. We curate a training set of 15,869 images from the MS-COCO [40] training split, annotated using our Label Any3D pipeline without any human refinement. |
| Hardware Specification | Yes | Training takes approximately 48 hours on 4 NVIDIA A40 GPUs. |
| Software Dependencies | No | Our implementation is based on Py Torch3D [56] and Detectron2 [70]. Following [75], we use DINOv2-Base [50] as the image feature encoder and freeze its parameters during training. Specific version numbers for these software components are not provided. |
| Experiment Setup | Yes | We train only the lifting head of OVMono3D [75] using ground-truth 2D bounding boxes. ... We train the model using SGD with an initial learning rate of 0.0012, which decays by a factor of 10 at 60% and 80% of training. A linear warm-up is applied for the first 1.8k steps. ... fine-tuned for 58k steps with a batch size of 64. |