Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Object Concepts Emerge from Motion

Authors: Haoqian Liang, Xiaohui Wang, Zhichao Li, Ya Yang, Naiyan Wang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate our approach on three downstream tasks spanning both low-level (monocular depth estimation) and high-level (3D object detection and occupancy prediction) vision. Our models outperform previous supervised and self-supervised baselines and demonstrate strong generalization to unseen scenes. These results suggest that motion-induced object representations offer a compelling alternative to existing vision foundation models, capturing a crucial but overlooked level of abstraction: the visual instance. The implementation can be found here: https://github.com/yulemao/Object_Concepts_Emerge_from_Motion
Researcher Affiliation	Academia	Haoqian Liang1, Xiaohui Wang1, Zhichao Li, Ya Yang1 , Naiyan Wang 1Beijing University of Posts and Telecommunications EMAIL, EMAIL, EMAIL, EMAIL, EMAIL
Pseudocode	Yes	A Pseudo-codes for Pixel Cluster Algorithm 1 Pixel Cluster
Open Source Code	Yes	The implementation can be found here: https://github.com/yulemao/Object_Concepts_Emerge_from_Motion
Open Datasets	Yes	We use two datasets in our approach: Open DV-You Tube [67] and nu Plan [20]. Both datasets provide a large amount of high-quality and diverse unlabeled video data. Open DV-You Tube contains videos collected from more than 244 cities all over the world, resulting in a total of 1747 hours of front-view videos. nu Plan provides 8 different camera views. It collects 1200 hours of driving data from 4 cities, 120 hours of which were recorded with 8 different camera views. We merged the two datasets and obtained approximately 2,700 hours of raw video data in total.
Dataset Splits	Yes	We evaluate our model on the KITTI dataset [16] using the standard Eigen split [15], with DCDepth [57] as the decoder. As shown in Tab. 1, our model consistently outperforms both supervised Image Net-22K pretraining and models pretrained on the Semantic-SAM [30], which is a weakly supervised method utilizing large-scale pseudo segmentation annotations.
Hardware Specification	No	All training and test details are speciﬁed in Sec. 4. [Justification from NeurIPS Paper Checklist, but Section 4 only details models and training parameters, not specific hardware like GPU/CPU models.]
Software Dependencies	No	We implement the proposed method using Py Torch [43] and mm Pretrain [11]. We train models on Swin Transformer [35] (Tiny to Large) and Res Net-50 [21].
Experiment Setup	Yes	We implement the proposed method using Py Torch [43] and mm Pretrain [11]. We train models on Swin Transformer [35] (Tiny to Large) and Res Net-50 [21]. All Swin models use a window size of 7, while the B and L variants of Sim MIM [64] and Semantic-SAM [30], which are used for comparison, adopt a larger window size of 12. This larger window is usually beneﬁcial due to the increased context, at the expense of higher computational cost. Adam W optimizer [37] with a weight decay of 0.05 is adopted. All models are trained for 200 epochs using a cosine decay learning rate scheduler and 10 epochs of linear warm-up. The initial learning rate is set to 0.001 and batch size is set to 2048. All input images are cropped and resized to a resolution of 224 224. We employ a data augmentation strategy that includes random ﬂipping, brightness, and gamma adjustment. We sample 200 labeled pixels from each image for training. We further ﬁne-tune the models for 20 epochs with an initial learning rate of 2 10 5 and a weight decay of 10 4. During ﬁne-tuning, two random crops are extracted from each input image, and the loss is calculated both within each crop and between the two. This ﬁne-tuning process further enhances the separation of distant objects in large images. All downstream models are trained with ofﬁcial open-sourced code for comparison. During ﬁne-tuning on downstream tasks, only the pretrained weights of the backbone are utilized for a fair comparison.