R-MAE: Regions Meet Masked Autoencoders

Authors: Duy Kien Nguyen, Yanghao Li, Vaibhav Aggarwal, Martin R. Oswald, Alexander Kirillov, Cees G. M. Snoek, Xinlei Chen

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 4 EXPERIMENTS", "4.1 EXPERIMENTAL SETUPS", "4.2 EXPERIMENTAL RESULTS", "Ablation studies.
Researcher Affiliation Collaboration 1FAIR, Meta AI 2University of Amsterdam
Pseudocode No The paper describes its methods verbally and with diagrams, but does not include structured pseudocode or algorithm blocks.
Open Source Code Yes 1The code is provided at https://github.com/facebookresearch/r-mae
Open Datasets Yes Deviating from prior practices (Bao et al., 2022; He et al., 2022), we develop RAE and R-MAE by pre-training on COCO train2017 (Lin et al., 2014)... For fairness, we also pre-train Vi Ts with R-MAE on Image Net (Deng et al., 2009)...
Dataset Splits Yes Deviating from prior practices (Bao et al., 2022; He et al., 2022), we develop RAE and R-MAE by pre-training on COCO train2017 (Lin et al., 2014)... For fairness, we also pre-train Vi Ts with R-MAE on Image Net (Deng et al., 2009)...
Hardware Specification No The paper discusses computational overheads in terms of FLOPs but does not specify any GPU models, CPU types, or other hardware used for running experiments.
Software Dependencies No The paper refers to frameworks like PyTorch and vision transformers but does not list specific version numbers for software dependencies like Python, PyTorch, or CUDA.
Experiment Setup Yes Our base learning rate is set to 1e-4... Vi T-B (Dosovitskiy et al., 2020) is set as the pixel backbone, and a 1-block, 128-dimensional Vi T is used for the neck, the region encoder and the region decoder... k=8 regions are randomly sampled per image... mask ratio of βR=0.75... The input image size is 1024 1024 with large-scale jitter between a scale range of [0.1, 2.0]. We finetune for 100 epochs with batch size of 64.