Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Understanding and Enhancing Mask-Based Pretraining towards Universal Representations

Authors: Mingze Dong, Leda Wang, Yuval Kluger

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The theoretical framework and its implications have been validated across diverse neural architectures (including MLPs, CNNs, and Transformers) applied to both vision and language tasks. Guided by our theory, we propose an embarrassingly simple yet overlooked pretraining scheme named Randomly Random Mask Auto Encoding (R2MAE), which enforces capturing multi-scale features from data and is able to outperform optimal fixed mask ratio settings in our linear model framework. We implement R2MAE in vision, language, DNA sequence, and single-cell models, where it consistently outperforms standard and more complicated masking schemes, leading to improvements for state-of-the-art models.
Researcher Affiliation Academia Mingze Dong Yale University EMAIL Leda Wang Yale University EMAIL Yuval Kluger Yale University EMAIL
Pseudocode No The paper text describes the R2MAE scheme as: "Expose the model to data corrupted with a uniformly sampled masking ratio p U(pmin, pmax)." This is a textual description of the method rather than a structured pseudocode or algorithm block.
Open Source Code Yes Our code is available at this URL.
Open Datasets Yes We used the standard MNIST dataset that consists of 60,000 training and 10,000 test grayscale images of handwritten digits at 28 28 pixels. The Celeb A dataset contains over 200,000 celebrity face images with 40 attribute annotations. We used the ViT-base MAE model and the Image Net-1K training split as pretraining data, following the MAE codebase [6]. We used the Hugging Face RoBERTa-medium and RoBERTa-base models, and the 10B token subset of Fine Web (sample-10BT, downloaded from Hugging Face) [75] as the training set. We employed the Human Lung Cell Atlas dataset [63] and human brain MTG SEA-AD dataset [64].
Dataset Splits Yes We used the standard MNIST dataset that consists of 60,000 training and 10,000 test grayscale images. The Celeb A dataset ... using the official training/validation/test split. For the HLCA dataset, we further filtered out cells that have fewer than 20 of these HVGs. After preprocessing, these datasets have 2161082 and 1378211 cells respectively. For single-cell gene expression models: 90% of data were selected as the training set and the remaining 10% was set as the validation set.
Hardware Specification Yes All experiments used PyTorch on a NVIDIA A6000 GPU with fixed random seeds. All experiments used a NVIDIA 6000 GPU with fixed random seeds. Each experiment was performed on one NVIDIA H100 GPU with the same fixed random seed. All experiments used PyTorch on 4 NVIDIA 6000 Ada GPUs with fixed random seeds. All experiments used PyTorch on an NVIDIA 6000 GPU with fixed random seeds.
Software Dependencies No We implemented PyTorch, Adam optimizer, AdamW optimizer, and Scanpy. However, specific version numbers for these software components are not provided in the paper text.
Experiment Setup Yes For MNIST: Models were trained for 15 epochs using Adam optimizer (learning rate 0.003, batch size 128). For Celeb A: Each model was trained for 10 epochs using Adam optimizer (learning rate 0.001, batch size 256). For ViT MAE: We trained all models for 150 epochs with 10 warmup epochs. For RoBERTa: We used AdamW optimizer, a max sequence length of 128, an effective batch size of 2048, a weight decay of 0.01, a warmup ratio of 0.03, a learning rate of 7e-4/3e-4 for the RoBERTa-medium/base models, and default linear learning rate decay. Fine-tuning was performed on the GLUE datasets (MNLI, QQP, SST-2, QNLI) for 5 epochs with a learning rate of 2e-5 and a batch size of 32. For DNA sequence models: Models were trained for 30000 steps using the defaults in [16] with AdamW optimizer, learning rate 1e-4, and effective batch size 2048. For Single-cell gene expression models: Models were trained for 50 epochs using Adam optimizer (learning rate 1e-3, weight decay 1e-4, batch size 400).