T-MARS: Improving Visual Representations by Circumventing Text Feature Learning

Authors: Pratyush Maini, Sachin Goyal, Zachary Chase Lipton, J Zico Kolter, Aditi Raghunathan

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimentally, T-MARS is the top ranked approach on Imagenet at medium scale of Data Comp (a data filtering benchmark), and outperforms CLIP filtering by a margin of 6.5% on Image Net and 4.7% on VTAB.We extensively evaluate zero-shot accuracies on a suite of benchmarks considered in prior work (Radford et al., 2021; Wortsman et al., 2021): (a) Image Net: a 1000-class image classification challenge (Russakovsky et al., 2015); (b) Image Net-OOD: Six associated imagenet distribution shifts Image Net V2 (Recht et al., 2019), Image Net-R (Hendrycks et al., 2020), Image Net-A (Hendrycks et al., 2019), Image Net-Sketch (Wang et al., 2019), Image Net-O (Hendrycks et al., 2019), and Object Net (Barbu et al., 2019); (c) VTAB: 12 datasets from the Visual Task Adaptation Benchmark (Zhai et al., 2020), including Caltech101, CIFAR100, DTD, Flowers102, Pets, SVHN, Resisc45, Euro SAT, Patch Camelyon, Clevr Counts, Clevr Distance, KITTI and Sun397; and (d) Retrieval: 3 retrieval tasks of MSCOCO (Chen et al., 2015), Flickr (Young et al., 2014) and Wino GAVi L (Bitton et al., 2022).
Researcher Affiliation Collaboration Pratyush Maini Sachin Goyal Zachary C. Lipton J. Zico Kolter Aditi Raghunathan Carnegie Mellon University Bosch Center for AI {pratyushmaini,sachingoyal,zlipton,zkolter,raditi}@cmu.edu
Pseudocode Yes Algorithm 1 T-MARS Input: Dataset S = {i, t}n, score function ℓ, image masking function m Output: Filtered Pool S // Step 1: Text-Masking for k = 0 . . . n 1 do ik = m(ik) end for // Step 2: Re-Scoring for k = 0 . . . n 1 do sk = ℓ( ik, tk) end for α = Median ({sk}n k=1) return S = {(ik, tk) | sk α}
Open Source Code No The paper thanks the Datacomp and Open CLIP teams for their codebase used for training, but it does not provide an explicit statement or link for the open-source code of their own methodology, T-MARS.
Open Datasets Yes We first experiment on six different data pools ranging from 2M to 64M samples chosen from the LAION-400M dataset. Note that the compute budget (total training samples seen i.e. epochsPublished as a conference paper at ICLR 2024) is kept the same as the pool size. For example, for a 32M pool size, the total samples which can be seen during training is kept at 32M (i.e. 1 epoch over the whole dataset). In cases where filtering methods retain a smaller subset (say 16M samples) of the data pool, they get the advantage of running more iterations (2 epochs over 16M subset i.e. total 32M samples seen) over the chosen subset. Finally, we also experiment on the 12.8M (small scale) and 128M (medium scale) data pool of the recently released Data Comp. We use the implementation of the Datacomp library to standardize the training process. We train both Res Net 50 and Vi T-B-32 models with a batch size of 1024, using cosine learning rate with 200 steps of warmup at 5e 4. We use Adam W as the optimizer for training. All the experiments were performed on NVIDIA A6000 GPUs.5.2 EVALUATION DATASETSWe extensively evaluate zero-shot accuracies on a suite of benchmarks considered in prior work (Radford et al., 2021; Wortsman et al., 2021): (a) Image Net: a 1000-class image classification challenge (Russakovsky et al., 2015); (b) Image Net-OOD: Six associated imagenet distribution shifts Image Net V2 (Recht et al., 2019), Image Net-R (Hendrycks et al., 2020), Image Net-A (Hendrycks et al., 2019), Image Net-Sketch (Wang et al., 2019), Image Net-O (Hendrycks et al., 2019), and Object Net (Barbu et al., 2019); (c) VTAB: 12 datasets from the Visual Task Adaptation Benchmark (Zhai et al., 2020), including Caltech101, CIFAR100, DTD, Flowers102, Pets, SVHN, Resisc45, Euro SAT, Patch Camelyon, Clevr Counts, Clevr Distance, KITTI and Sun397; and (d) Retrieval: 3 retrieval tasks of MSCOCO (Chen et al., 2015), Flickr (Young et al., 2014) and Wino GAVi L (Bitton et al., 2022).
Dataset Splits No The paper describes training on subsets of the LAION dataset and evaluating on external benchmark datasets (e.g., ImageNet, VTAB, Data Comp, Imagenette). While it mentions using Conceptual Captions (CC3M) for fine-tuning a validation model, it does not specify explicit train/validation/test splits of its primary LAION-based data pools for the main experiments.
Hardware Specification Yes All the experiments were performed on NVIDIA A6000 GPUs.
Software Dependencies No The paper mentions using the 'Datacomp library', 'Open CLIP', and 'MMOCR library', but it does not specify version numbers for these software components or any other key dependencies.
Experiment Setup Yes We train both Res Net 50 and Vi T-B-32 models with a batch size of 1024, using cosine learning rate with 200 steps of warmup at 5e 4. We use Adam W as the optimizer for training.We train a randomly initialized Vi T-B-32 vision encoder with a pre-trained Ro BERTa text encoder for 120 steps of warmup followed by a cosine schedule with a maximum learning rate of 1e 3. The number of training steps is the same across all training runs (fixed at 600 steps at a batch size of 1024).