Bootstrapping Top-down Information for Self-modulating Slot Attention

Authors: Dongwon Kim, Seoyeon Kim, Suha Kwak

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 4 Experiments
Researcher Affiliation Academia Dept. of CSE, POSTECH1 Graduate School of AI, POSTECH2 {kdwon, syeonkim07, suha.kwak}@postech.ac.kr
Pseudocode Yes Algorithm 1 Self-modulating Slot Attention.
Open Source Code No Code will be made public for open access after publication.
Open Datasets Yes Datasets To verify the proposed method in diverse settings, including synthetic and authentic datasets, we considered four object-centric learning benchmarks: MOVI-C [15], MOVI-E [15], PASCAL VOC 2012 [14], and MS COCO 2017 [29].
Dataset Splits Yes MOVI-C contains 87,633 images for training and 6,000 images for evaluation, while MOVI-E contains 87,741 and 6,000, respectively. ... For the evaluation, we use the validation split containing 1,449 images. The COCO dataset consists of 118,287 training images and 5,000 images for evaluation.
Hardware Specification Yes Full training of the model takes 26 hours using a single NVIDIA RTX 3090 GPU.
Software Dependencies No The paper mentions software components and frameworks like DINO, ViT, Adam optimizer, GRU, MLP, and autoregressive transformer decoder, but does not provide specific version numbers for these software dependencies or the programming language used.
Experiment Setup Yes The model is trained using an Adam optimizer [26] with an initial learning rate of 0.0004, while the encoder parameters are not trained. The number of slots K is set to 11, 24, 7, and 6 for MOVI-C, MOVI-E, COCO, and VOC, respectively. The codebook size E is set to 128 for synthetic datasets (MOVI-C and MOVI-E) and 512 for authentic datasets (COCO and VOC). The model is trained for 250K iterations on VOC and for 500K iterations on the others.