CSP: Self-Supervised Contrastive Spatial Pre-Training for Geospatial-Visual Representations

Authors: Gengchen Mai, Ni Lao, Yutong He, Jiaming Song, Stefano Ermon

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments show that CSP can improve model performance on both i Nat2018 and f Mo W dataset. Especially, on i Nat2018, CSP significantly boosts the model performance with 10-34% relative improvement with various labeled training data sampling ratios. We conduct experiments on geo-aware image classification tasks including fine-grained species recognition (Chu et al., 2019; Mac Aodha et al., 2019; Mai et al., 2020b; Yang et al., 2022), and remote sensing (RS) image classification (Christie et al., 2018; Ayush et al., 2021; Manas et al., 2021; Li et al., 2021a).
Researcher Affiliation Collaboration 1Spatially Explicit Artificial Intelligence Lab, Department of Geography, University of Georgia, USA 2Department of Computer Science, Stanford University, USA 3School of Computing, University of Georgia, USA 4Google Inc, USA 5Machine Learning Department, Carnegie Mellon University, USA.
Pseudocode No The paper does not contain any pseudocode or clearly labeled algorithm blocks. It uses figures to illustrate architectures and processes, but these are not formatted as pseudocode.
Open Source Code Yes Code, data, and pre-trained models are available at https: //gengchenmai.github.io/csp-website/. Our code and used datasets are available from https://gengchenmai.github.io/csp-website/.
Open Datasets Yes We use the i Nat2018 dataset4 (Van Horn et al., 2018) as a representative dataset to study the effectiveness of CSP on species fine-grained recognition. (Footnote 4: https://github.com/visipedia/inat_comp/tree/master/2018) A similar procedure is carried out on f Mo W6 dataset (Christie et al., 2018), which has 62 different geospatial object classes, and 363,570 location-image pairs. (Footnote 6: https://github.com/f Mo W/dataset)
Dataset Splits Yes For each task, three datasets are used to pre-train, fine-tune, and evaluate our CSP models: Xtrain is a set of unlabeled location-image pairs we use for pre-training; Xtrain is a set of labeled location-image-class tuples we use for fine-tuning, where the size of Xtrain is much larger than that of Xtrain, i.e., |Xtrain| |Xtrain|; and Xval is a set of labeled location-image-class tuples we use for evaluation that can not be seen during fine-tuning. The i Nat2018 validation dataset is used for model evaluation to make our results comparable with previous work (Mac Aodha et al., 2019; Mai et al., 2020b; 2022d). Table 1 compares the Top1 accuracy of different training strategies on the i Nat2018 validation dataset with different λ%. Table 5 compares the evaluation results (Top1 accuracy) among different models and training strategies on the f Mo W val dataset after fine-tuning on λ% f Mo W training samples where λ% {5%, 10%, 20%, 100%}.
Hardware Specification No The paper states: "All models are implemented in Py Torch and trained on a Linux machine with 252GB memory and two Geo Force CUDA cores." This description is not specific enough. "Geo Force CUDA cores" refers to a brand and a generic component, not a specific GPU model (e.g., NVIDIA GeForce RTX 3080 or A100) or the exact number of CUDA cores, which would be required for reproducibility.
Software Dependencies No The paper mentions: "All models are implemented in Py Torch and trained on a Linux machine..." and references "Py Torch Vision library" and "Huggingface timm library." However, it does not provide specific version numbers for any of these software components (PyTorch, PyTorch Vision, Huggingface timm, or Linux distribution), which are necessary for reproducible software dependencies.
Experiment Setup Yes The major hyperparameters we tune include the fine-tuning learning rate ηsuper = [0.01, 0.005, 0.002, 0.001, 0.0005, 0.00005], the grid s minimum scaling factor rmin = [0.1, 0.01, 0.001, 0.0005, 0.0001], as well as the hyperparameters of location encoder s multi-layer perceptron NNffn( ) such as its activation function σe = [Re LU, Leaky Re LU, GELU], the number of hidden layers h = [1, 2, 3], the number of neurons k = [256, 512, 1024], and the dropout rate in NNffn( ) D = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7]. Based on the experiment, the best hyperparameter combination for the few-shot learning on i Nat2018 dataset is ηsuper = 0.0005, rmin = 0.01, σe = Leaky Re LU, h = 1, k = 512, dropout = 0.5. For CSP-MC-*, the best hyperparameter combination is ηunsuper = 0.0002, α1 = 1, α2 = 1, C = 1, τ0 = 1, τ1 = 1, and τ2 = 1. For CSP-NCE-*, the best combination is ηunsuper = 0.0002, β1 = 1 and β2 = 1.