No Train, all Gain: Self-Supervised Gradients Improve Deep Frozen Representations

Authors: Walter Simoncini, Andrei Bursuc, Spyridon Gidaris, Yuki Asano

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate our FUNGI features using the task of k NN classification. To show the generalizability of our method, we evaluate our features across Vi T backbones (Dosovitskiy et al., 2021) with varying model sizes and pretraining strategies, including both supervised and self-supervised methods. We conduct our experiments on 11 diverse downstream datasets, described in Appendix D. Unless otherwise specified, we report the average accuracy across these datasets.
Researcher Affiliation Collaboration Walter Simoncini1, Spyros Gidaris2 Andrei Bursuc2 Yuki M. Asano1 1QUVA Lab, University of Amsterdam; 2valeo.ai, Paris, France
Pseudocode Yes Algorithm 1 provides pytorch-style pseudocode for the computation of LKL, the gradient extraction, and the computation of FUNGI features (without PCA).
Open Source Code Yes Code is available at https://github.com/WalterSimoncini/fungivision.
Open Datasets Yes We investigate the performance of our gradient-enhanced features on 11 image classification datasets, namely CIFAR 10 and CIFAR 100 (Krizhevsky et al., 2009), Oxford Flowers 102 (Nilsback & Zisserman, 2008), Food101 (Bossard et al., 2014), Image Net-1K (Russakovsky et al., 2015), FGVC Aircraft (Maji et al., 2013), CUB 200-2011 (Wah et al., 2011), Oxford-IIT Pets (Parkhi et al., 2012), Stanford Cars (Krause et al., 2013), DTD Textures (Cimpoi et al., 2014) and Euro SAT (Helber et al., 2019), 5 text classification datasets: TREC (Li & Roth, 2002) in its coarse version, banking-77 (Casanueva et al., 2020), Stanford Sentiment Treebank (SST) (Socher et al., 2013) in its fine-grained version, AG news (Zhang et al., 2015; Gulli, 2005) and tweet eval (emoji) (Barbieri et al., 2018, 2020) and 2 audio classification datasets: ESC 50 (Piczak, 2015), an environmental sound classification dataset, and Speech Commands V2 (Warden, 2018).
Dataset Splits Yes We use the default splits defined by torchvision or the dataset authors where possible. As Euro SAT does not explicitly define a test split, we use an 80/20 stratified split as indicated by the dataset paper. We always report metrics on the test splits, with the exception of Image Net, for which we use the validation split.
Hardware Specification Yes The gradient features were extracted using a machine with a single NVIDIA A100 GPU with 40GB of VRAM. As for the text and audio classification experiments, they require around 3 GPU hours per backbone, for a total of 9 hours. The extracted gradient features were reused for the linear probing and clustering experiments, the former requiring 168 hours on a machine with a single AMD EPYC 7H12 CPU and the latter requiring 18 hours on a machine with a single NVIDIA A100 GPU with 40GB of VRAM.
Software Dependencies No The paper mentions software like "scikit-learn" (Pedregosa et al., 2011), "faiss" (Johnson et al., 2019; Douze et al., 2024), and "cyanure library" (Mairal, 2019) but does not provide specific version numbers for these software components to ensure reproducibility.
Experiment Setup Yes The parameters used for each loss are shown in Table 22. This set of parameters is used consistently across backbones and datasets. While LKL and LDINO are robust to the choice of hyperparameters, LSim CLR is particularly sensitive to the number of positive views, as shown in Figure 12, where performance increases in a logarithmic fashion as more positive views are used, at the cost of gradient extraction speed.