CROMA: Remote Sensing Representations with Contrastive Radar-Optical Masked Autoencoders

Authors: Anthony Fuller, Koreen Millard, James Green

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental CROMA outperforms the current So TA multispectral model, evaluated on: four classification benchmarks finetuning (avg. 1.8%), linear (avg. 2.4%) and nonlinear (avg. 1.4%) probing, k NN classification (avg. 3.5%), and K-means clustering (avg. 8.4%); and three segmentation benchmarks (avg. 6.4%).
Researcher Affiliation Academia 1Department of Systems and Computer Engineering 2Department of Geography and Environmental Studies Carleton University, Ottawa, Canada
Pseudocode No The paper describes the model architecture and objectives in text and diagrams (Figure 1) but does not provide pseudocode or a clearly labeled algorithm block.
Open Source Code Yes Code and pretrained models: https://github.com/antofuller/CROMA
Open Datasets Yes We pretrain CROMA models on the SSL4EO dataset [70] a large geographically and seasonally diverse unlabeled dataset. ... The multi-label Big Earth Net dataset [76]... The f Mo W-Sentinel dataset [26]... The Euro SAT dataset [77]... The Canadian Cropland dataset [78]... The DFC2020 dataset [87]... The Dynamic World dataset [88]... The MARIDA dataset [89]
Dataset Splits Yes The multi-label Big Earth Net dataset [76] (35,420 train samples and 118,065 validation samples); this is 10% of the complete Big Earth Net training set that is now used by default [25, 26] to reduce the costs of finetuning and is better suited for a remote sensing benchmark [22].
Hardware Specification Yes We perform all pretraining experiments on an NVIDIA DGX server (8 A100 80 GB), including ablations.
Software Dependencies No The paper mentions using bfloat16 precision and the Adam W optimizer but does not specify versions for software dependencies like Python, PyTorch, or TensorFlow.
Experiment Setup Yes We use an NVIDIA DGX server (8 A100-80GB), the maximum batch size that can fit into 640 GB of VRAM (7,200 for our default Vi T-B), bfloat16 precision, a base learning rate of 4e-6, warmup for 5% of the total epochs, and cooldown via a cosine decay schedule. We use the same normalization procedure as Sat MAE [26]. For data augmentation, we randomly crop 60-180 pixel squares from the original 264 264 pixels and resize the crops to 120 120 pixels (our default image size). We also perform vertical and horizontal flipping, 90-degree rotations, and mixup=0.3.