reproducibilityindex.ai

Bridging the Gap to Real-World Object-Centric Learning

Authors: Maximilian Seitzer, Max Horn, Andrii Zadaianchuk, Dominik Zietlow, Tianjun Xiao, Carl-Johann Simon-Gabriel, Tong He, Zheng Zhang, Bernhard Schölkopf, Thomas Brox, Francesco Locatello

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Broadly, we pursue two goals with our experiments: 1) demonstrating that our approach significantly extends the capabilities of object-centric models towards real-world applicability (Sec. 4.1), and 2) showing that our approach is competitive with more complex methods from the computer vision literature (Sec. 4.2). Additionally, we ablate key model components to find what is driving the success of our method (Sec. 4.3). The main task we consider in this work is object discovery, that is, finding pixel masks for all object instances in an image.
Researcher Affiliation	Collaboration	1Max-Planck Institute for Intelligent Systems, T ubingen, Germany 2Amazon Web Services 3Department of Computer Science, ETH Z urich
Pseudocode	No	The paper does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Source code will be made available under https://github.com/amazon-science/object-centric-learning-framework.
Open Datasets	Yes	All datasets used in this work (MOVi, PASCAL VOC 2012, COCO, KITTI) are public and can be obtained on their respective web pages.
Dataset Splits	Yes	MOVi-C validation 6 000 Val. split w. instance segm. labels MOVi-E validation 6 000 Val. split w. instance segm. labels VOC 2012 validation 1 449 Val. split w. instance segm. labels COCO 2017 validation 5 000 Val split w. instance segm. labels
Hardware Specification	Yes	The models were trained on 8 NVIDIA V100 GPUs with a local batch size of 8, with 16-bit mixed precision.
Software Dependencies	Yes	We use the Vision Transformer implementation and provided by the timm library (Wightman, 2019)4.
Experiment Setup	Yes	We train DINOSAUR using the Adam optimizer (Kingma & Ba, 2015) with a learning rate of 4 10 4, linear learning rate warm-up of 10 000 optimization steps and an exponentially decaying learning rate schedule. Further, we clip the gradient norm at 1 in order to stabilize training and train for 500k steps for the MOVI and COCO datasets and 250k steps for PASCAL VOC. The models were trained on 8 NVIDIA V100 GPUs with a local batch size of 8, with 16-bit mixed precision. For the experiments on synthetic data, we use a Vi T with patch size 8 and the MLP decoder. For the experiments on real-world data, we use a Vi T with patch size 16 and the Transformer decoder. We analyze the impact of different decoders in Sec. 4.3. The main results are averaged over 5 random seeds; other experiments use 3 seeds. Further implementation details can be found in App. E.1.