Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Object Concepts Emerge from Motion

Authors: Haoqian Liang, Xiaohui Wang, Zhichao Li, Ya Yang, Naiyan Wang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate our approach on three downstream tasks spanning both low-level (monocular depth estimation) and high-level (3D object detection and occupancy prediction) vision. Our models outperform previous supervised and self-supervised baselines and demonstrate strong generalization to unseen scenes. These results suggest that motion-induced object representations offer a compelling alternative to existing vision foundation models, capturing a crucial but overlooked level of abstraction: the visual instance. The implementation can be found here: https://github.com/yulemao/Object_Concepts_Emerge_from_Motion
Researcher Affiliation Academia Haoqian Liang1, Xiaohui Wang1, Zhichao Li, Ya Yang1 , Naiyan Wang 1Beijing University of Posts and Telecommunications EMAIL, EMAIL, EMAIL, EMAIL, EMAIL
Pseudocode Yes A Pseudo-codes for Pixel Cluster Algorithm 1 Pixel Cluster
Open Source Code Yes The implementation can be found here: https://github.com/yulemao/Object_Concepts_Emerge_from_Motion
Open Datasets Yes We use two datasets in our approach: Open DV-You Tube [67] and nu Plan [20]. Both datasets provide a large amount of high-quality and diverse unlabeled video data. Open DV-You Tube contains videos collected from more than 244 cities all over the world, resulting in a total of 1747 hours of front-view videos. nu Plan provides 8 different camera views. It collects 1200 hours of driving data from 4 cities, 120 hours of which were recorded with 8 different camera views. We merged the two datasets and obtained approximately 2,700 hours of raw video data in total.
Dataset Splits Yes We evaluate our model on the KITTI dataset [16] using the standard Eigen split [15], with DCDepth [57] as the decoder. As shown in Tab. 1, our model consistently outperforms both supervised Image Net-22K pretraining and models pretrained on the Semantic-SAM [30], which is a weakly supervised method utilizing large-scale pseudo segmentation annotations.
Hardware Specification No All training and test details are specified in Sec. 4. [Justification from NeurIPS Paper Checklist, but Section 4 only details models and training parameters, not specific hardware like GPU/CPU models.]
Software Dependencies No We implement the proposed method using Py Torch [43] and mm Pretrain [11]. We train models on Swin Transformer [35] (Tiny to Large) and Res Net-50 [21].
Experiment Setup Yes We implement the proposed method using Py Torch [43] and mm Pretrain [11]. We train models on Swin Transformer [35] (Tiny to Large) and Res Net-50 [21]. All Swin models use a window size of 7, while the B and L variants of Sim MIM [64] and Semantic-SAM [30], which are used for comparison, adopt a larger window size of 12. This larger window is usually beneficial due to the increased context, at the expense of higher computational cost. Adam W optimizer [37] with a weight decay of 0.05 is adopted. All models are trained for 200 epochs using a cosine decay learning rate scheduler and 10 epochs of linear warm-up. The initial learning rate is set to 0.001 and batch size is set to 2048. All input images are cropped and resized to a resolution of 224 224. We employ a data augmentation strategy that includes random flipping, brightness, and gamma adjustment. We sample 200 labeled pixels from each image for training. We further fine-tune the models for 20 epochs with an initial learning rate of 2 10 5 and a weight decay of 10 4. During fine-tuning, two random crops are extracted from each input image, and the loss is calculated both within each crop and between the two. This fine-tuning process further enhances the separation of distant objects in large images. All downstream models are trained with official open-sourced code for comparison. During fine-tuning on downstream tasks, only the pretrained weights of the backbone are utilized for a fair comparison.