Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

MetaSlot: Break Through the Fixed Number of Slots in Object-Centric Learning

Authors: Hongjia Liu, Rongzhen Zhao, Haohan Chen, Joni K. Pajarinen

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments across diverse vision tasks and datasets show that Meta Slot yields substantial improvements on key metrics and exhibits strong adaptability to a wide range of scenes. The paper includes a dedicated section '4 Experiments' covering object discovery, set prediction, interpretability analysis, and ablation studies, presenting numerous tables and figures with performance metrics.
Researcher Affiliation Academia 1Department of Electrical Engineering and Automation, Aalto University, Espoo, Finland 2Department of Computer Science, Sichuan University, Chengdu, China EMAIL EMAIL
Pseudocode Yes In addition, we include pseudocode in Appendix A to provide additional implementation details. [...] Algorithm 1: Meta Slot.
Open Source Code Yes The code is available at https://github.com/lhj-lhj/Meta Slot.
Open Datasets Yes Datasets We include both synthetic and real-world datasets. Clevr Tex [67] comprises synthetic images, each with about 10 geometric objects scattered in complex backgrounds. MS COCO 2017 [68] is a recognized real-world image dataset, and we use its challenging panoptic segmentation and instance-level object annotations. PASCAL VOC 2012 [69] is a real-world image dataset, and we use its instance segmentation. We also report results on the real-world video dataset HQ-YTVIS [70], which contains large-scale short videos from You Tube.
Dataset Splits No The paper mentions using several publicly available datasets like Clevr Tex, MS COCO 2017, PASCAL VOC 2012, and HQ-YTVIS. However, it does not explicitly state the training, validation, and test splits (e.g., 80/10/10%) or provide sample counts for each split needed to reproduce the data partitioning.
Hardware Specification Yes Every model including both the baselines and our variants augmented with Meta Slot was trained for 50 k steps with the Adam optimizer [71] on a single NVIDIA V100 GPU using 16-bit mixed precision and a batch size of 32;
Software Dependencies No To ensure a fair comparison, we re-implemented all baseline models from scratch rather than relying on publicly reported results. Throughout all experiments, we kept data augmentation strategies, the visual feed-forward module (VFM) in the OCL encoder based on DINOv2 Vi T-s/14 [50] and all training hyperparameters identical to those reported in the original papers. Furthermore, we replaced each model s original variational autoencoder (VAE) component with a large-scale pre-trained TAESD module [78], which is based on Stable Diffusion. The paper does not explicitly state software versions for Python, PyTorch, CUDA, or other key libraries required for the experimental environment.
Experiment Setup Yes All experiments share identical data augmentation pipelines and use the DINOv2 Vi T(s/14) [50] as the OCL encoder, with matched training hyperparameters. Every model including both the baselines and our variants augmented with Meta Slot was trained for 50 k steps with the Adam optimizer [71] on a single NVIDIA V100 GPU using 16-bit mixed precision and a batch size of 32; the Meta Slot codebook size was fixed to 512 throughout. [...] We set the initial learning rate to 2 x 10^-4 and maintained it throughout training. For the Meta Slot module, the codebook size was fixed at 512, the feature map resolution at 256 x 256, and the embedding dimension at 256. The number of slots for each model remained consistent with its original baseline configuration.