Semantic-Aware Autoregressive Image Modeling for Visual Representation Learning
Authors: Kaiyou Song, Shan Zhang, Tong Wang
AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments are conducted on a broad range of downstream tasks, including image classification, object detection, and instance/semantic segmentation, to evaluate the performance of Sem AIM. The results demonstrate Sem AIM achieves state-of-the-art performance compared with other self-supervised methods. |
| Researcher Affiliation | Industry | Megvii Technology {songkaiyou, zhangshan, wangtong}@megvii.com |
| Pseudocode | No | The paper does not contain any pseudocode or clearly labeled algorithm blocks. The method is described in text and through diagrams. |
| Open Source Code | Yes | Code is available at https://github.com/skyoux/Sem AIM. |
| Open Datasets | Yes | All self-supervised pre-training is performed on the Image Net (Russakovsky et al. 2015) training set with a resolution of 224 224. We conduct end-to-end fine-tuning on COCO (Lin et al. 2014) with 1024 1024 resolution. We perform end-to-end fine-tuning on ADE20k (Zhou et al. 2017). |
| Dataset Splits | Yes | For fine-tuning, 100 epochs with a 1024 batch size are performed following common practices (He et al. 2022; Bao, Dong, and Wei 2022) by default. We report top-1 accuracy of a single 224 224 resolution on the Image Net (Russakovsky et al. 2015) validation set. |
| Hardware Specification | No | The paper does not explicitly describe the specific hardware used for running its experiments, such as GPU models, CPU types, or cloud computing instances. |
| Software Dependencies | No | The paper mentions software like "detectron2 (Wu et al. 2019)", "Vi TDet (Li et al. 2022b)", and "mmsegmentation (Contributors 2020)", but it does not provide specific version numbers for these software dependencies, which are necessary for full reproducibility. |
| Experiment Setup | Yes | All self-supervised pre-training is performed on the Image Net (Russakovsky et al. 2015) training set with a resolution of 224 224. In default settings, we take Vi T-B/16 (Dosovitskiy et al. 2020) as the default backbone and pre-train models with a 2048 batch size for 200 epochs. For fine-tuning, 100 epochs with a 1024 batch size are performed. |