Masked Completion via Structured Diffusion with White-Box Transformers

Authors: Druv Pai, Sam Buchanan, Ziyang Wu, Yaodong Yu, Yi Ma

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive empirical evaluations confirm our analytical insights. CRATE-MAE demonstrates highly promising performance on large-scale imagery datasets while using only 30% of the parameters compared to the standard masked autoencoder with the same model configuration. In this section, we conduct experiments to evaluate CRATE-MAE on real-world datasets and both supervised and unsupervised tasks.
Researcher Affiliation Academia Druv Pai UC Berkeley Ziyang Wu UC Berkeley Sam Buchanan TTIC Yaodong Yu UC Berkeley Yi Ma UC Berkeley & HKU
Pseudocode Yes B.2 PYTORCH-LIKE PSEUDOCODE
Open Source Code Yes Code is available on Git Hub.
Open Datasets Yes We consider Image Net-1K (Deng et al., 2009) as the main experimental setting for our architecture. We fine-tune and linear probe our pre-trained CRATE-MAE on the following target datasets: CIFAR10/CIFAR100 (Krizhevsky et al., 2009), Oxford Flowers-102 (Nilsback & Zisserman, 2008), Oxford-IIIT-Pets (Parkhi et al., 2012).
Dataset Splits No The paper mentions the use of 'training and validation sets' (Table 4) and refers to 'standard practice' for MAE training, but it does not explicitly state the specific dataset split percentages or sample counts for training, validation, and test sets.
Hardware Specification No The paper does not provide specific hardware details such as GPU models (e.g., NVIDIA A100, RTX 2080 Ti) or CPU specifications used for running the experiments.
Software Dependencies No The paper mentions using Adam W optimizer and Scikit-Learn but does not provide specific version numbers for these or other software dependencies like PyTorch.
Experiment Setup Yes We configure the learning rate as 3 × 10−5, weight decay as 0.1, and batch size as 4,096. We configure the learning rate as 5 × 10−5, weight decay as 0.01, and batch size as 256.