SWAT: Spatial Structure Within and Among Tokens
Authors: Kumara Kahatapitiya, Michael S. Ryoo
IJCAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate our family of models, SWAT on image classification and semantic segmentation. We use Imagenet-1K [Deng et al., 2009] and ADE20K [Zhou et al., 2019] as benchmarks to compare against common Transformer/Mixer/Conv architectures such as Dei T [Touvron et al., 2021b], Swin [Liu et al., 2021], MLP-Mixer [Tolstikhin et al., 2021], Res MLP [Touvron et al., 2021a] and VAN [Guo et al., 2022]. |
| Researcher Affiliation | Academia | Kumara Kahatapitiya and Michael S. Ryoo Stony Brook University {kkahatapitiy, mryoo}@cs.stonybrook.edu |
| Pseudocode | No | The paper describes its methods using diagrams and text but does not include structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our code is available at github.com/kkahatapitiya/SWAT. |
| Open Datasets | Yes | We use Imagenet-1K [Deng et al., 2009] and ADE20K [Zhou et al., 2019] as benchmarks to compare against common Transformer/Mixer/Conv architectures... |
| Dataset Splits | Yes | Image Net-1K [Deng et al., 2009] is a commonly-used classification benchmark, with 1.2M training images and 50K validation images, annotated with 1000 categories. and ADE20K [Zhou et al., 2019] benchmark contains annotations for semantic segmentation across 150 categories. It comes with 25K annotated images in total, with 20K training, 2K validation and 3K testing. |
| Hardware Specification | Yes | FPS is measured on a single V100 GPU. |
| Software Dependencies | No | The paper mentions using the 'timm' library, 'mmsegmentation' framework, and 'PyTorch-like' implementations, but does not provide specific version numbers for any of these software components. |
| Experiment Setup | Yes | For all our models, we report Top-1 (%) accuracy on single-crop evaluation with complexity metrics such as Parameters and FLOPs. We train all our models for 300 epochs on inputs of 224x224 using the timm [Wightman, 2019] library. We use the original hyperparameters for all backbones, without further tuning. All models are trained with Mixed Precision. |