Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Attention Normalization Impacts Cardinality Generalization in Slot Attention
Authors: Markus Krimmel, Jan Achterhold, Joerg Stueckler
TMLR 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this paper, we demonstrate that design decisions on normalizing the aggregated values in the attention architecture have considerable impact on the capabilities of Slot Attention to generalize to a higher number of slots and objects as seen during training. We propose and investigate alternatives to the original normalization scheme which increase the generalization capabilities of Slot Attention to varying slot and object counts, resulting in performance gains on the task of unsupervised image segmentation. The newly proposed normalizations represent minimal and easy to implement modifications of the usual Slot Attention module, changing the value aggregation mechanism from a weighted mean operation to a scaled weighted sum operation. Section 5: Experiments. |
| Researcher Affiliation | Academia | Markus Krimmel EMAIL Embodied Vision Group, Max Planck Institute for Intelligent Systems Jan Achterhold EMAIL Embodied Vision Group, Max Planck Institute for Intelligent Systems Joerg Stueckler EMAIL Embodied Vision Group, Max Planck Institute for Intelligent Systems Intelligent Perception in Technical Systems Group, University of Augsburg |
| Pseudocode | Yes | G Pseudocode In Algorithms 1 and 2, we illustrate how the weighted sum and batch norm variants differ from the weighted mean variant in pseudo PyTorch code. We illustrate this in a diff format. Algorithm 1 Diff of Weighted Sum Variant Algorithm 2 Diff of Batch Norm Variant |
| Open Source Code | Yes | 1Code is available at https://github.com/Embodied Vision/slot_attention_normalization. |
| Open Datasets | Yes | We investigate the proposed normalizations on unsupervised object discovery tasks. To this end, we train autoencoders on the CLEVR (Johnson et al., 2017) and MOVi-C (Greff et al., 2022) datasets, utilizing autoencoder architectures that have been described in (Locatello et al., 2020) and (Seitzer et al., 2023), respectively. |
| Dataset Splits | Yes | The CLEVR dataset ... It consists of 100,000 2D renderings... Following Locatello et al. (2020), we use 70,000 images for training and further adopt the approach of (Locatello et al., 2020; Greff et al., 2019; Burgess et al., 2019) by cropping the images to highlight objects in the center. MOVi-C represents a significant step-up in perceptual complexity. It contains 10,986 video sequences... We use 250 of these video sequences for validation and hold out 999 sequences for testing. |
| Hardware Specification | No | The paper does not mention specific hardware details such as GPU models, CPU types, or memory amounts used for the experiments. It only describes the model architectures and training procedures. |
| Software Dependencies | No | The paper mentions 'Adam (Kingma & Ba, 2015) optimizer' and 'timm (Wightman, 2019) repository' but does not specify version numbers for these or other key software libraries like PyTorch. |
| Experiment Setup | Yes | We closely follow the training procedure of (Locatello et al., 2020). Namely, we train the autoencoder with an ℓ2 reconstruction loss, utilize 3 Slot Attention iterations during training, and use an Adam (Kingma & Ba, 2015) optimizer. The models are trained for 500,000 steps. As the authors of (Locatello et al., 2020), we linearly warm up the learning rate over the course of the first 10,000 steps, after which it attains a peak value of 4 (0.5)0.1 10 4. Subsequently, we decay it over the course of the remaining steps, with a half life of 100,000 steps. We use a batch size of 64. |