Bridging the Gap to Real-World Object-Centric Learning
Authors: Maximilian Seitzer, Max Horn, Andrii Zadaianchuk, Dominik Zietlow, Tianjun Xiao, Carl-Johann Simon-Gabriel, Tong He, Zheng Zhang, Bernhard Schölkopf, Thomas Brox, Francesco Locatello
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Broadly, we pursue two goals with our experiments: 1) demonstrating that our approach significantly extends the capabilities of object-centric models towards real-world applicability (Sec. 4.1), and 2) showing that our approach is competitive with more complex methods from the computer vision literature (Sec. 4.2). Additionally, we ablate key model components to find what is driving the success of our method (Sec. 4.3). The main task we consider in this work is object discovery, that is, finding pixel masks for all object instances in an image. |
| Researcher Affiliation | Collaboration | 1Max-Planck Institute for Intelligent Systems, T ubingen, Germany 2Amazon Web Services 3Department of Computer Science, ETH Z urich |
| Pseudocode | No | The paper does not contain any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Source code will be made available under https://github.com/amazon-science/object-centric-learning-framework. |
| Open Datasets | Yes | All datasets used in this work (MOVi, PASCAL VOC 2012, COCO, KITTI) are public and can be obtained on their respective web pages. |
| Dataset Splits | Yes | MOVi-C validation 6 000 Val. split w. instance segm. labels MOVi-E validation 6 000 Val. split w. instance segm. labels VOC 2012 validation 1 449 Val. split w. instance segm. labels COCO 2017 validation 5 000 Val split w. instance segm. labels |
| Hardware Specification | Yes | The models were trained on 8 NVIDIA V100 GPUs with a local batch size of 8, with 16-bit mixed precision. |
| Software Dependencies | Yes | We use the Vision Transformer implementation and provided by the timm library (Wightman, 2019)4. |
| Experiment Setup | Yes | We train DINOSAUR using the Adam optimizer (Kingma & Ba, 2015) with a learning rate of 4 10 4, linear learning rate warm-up of 10 000 optimization steps and an exponentially decaying learning rate schedule. Further, we clip the gradient norm at 1 in order to stabilize training and train for 500k steps for the MOVI and COCO datasets and 250k steps for PASCAL VOC. The models were trained on 8 NVIDIA V100 GPUs with a local batch size of 8, with 16-bit mixed precision. For the experiments on synthetic data, we use a Vi T with patch size 8 and the MLP decoder. For the experiments on real-world data, we use a Vi T with patch size 16 and the Transformer decoder. We analyze the impact of different decoders in Sec. 4.3. The main results are averaged over 5 random seeds; other experiments use 3 seeds. Further implementation details can be found in App. E.1. |