T-MARS: Improving Visual Representations by Circumventing Text Feature Learning
Authors: Pratyush Maini, Sachin Goyal, Zachary Chase Lipton, J Zico Kolter, Aditi Raghunathan
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimentally, T-MARS is the top ranked approach on Imagenet at medium scale of Data Comp (a data filtering benchmark), and outperforms CLIP filtering by a margin of 6.5% on Image Net and 4.7% on VTAB.We extensively evaluate zero-shot accuracies on a suite of benchmarks considered in prior work (Radford et al., 2021; Wortsman et al., 2021): (a) Image Net: a 1000-class image classification challenge (Russakovsky et al., 2015); (b) Image Net-OOD: Six associated imagenet distribution shifts Image Net V2 (Recht et al., 2019), Image Net-R (Hendrycks et al., 2020), Image Net-A (Hendrycks et al., 2019), Image Net-Sketch (Wang et al., 2019), Image Net-O (Hendrycks et al., 2019), and Object Net (Barbu et al., 2019); (c) VTAB: 12 datasets from the Visual Task Adaptation Benchmark (Zhai et al., 2020), including Caltech101, CIFAR100, DTD, Flowers102, Pets, SVHN, Resisc45, Euro SAT, Patch Camelyon, Clevr Counts, Clevr Distance, KITTI and Sun397; and (d) Retrieval: 3 retrieval tasks of MSCOCO (Chen et al., 2015), Flickr (Young et al., 2014) and Wino GAVi L (Bitton et al., 2022). |
| Researcher Affiliation | Collaboration | Pratyush Maini Sachin Goyal Zachary C. Lipton J. Zico Kolter Aditi Raghunathan Carnegie Mellon University Bosch Center for AI {pratyushmaini,sachingoyal,zlipton,zkolter,raditi}@cmu.edu |
| Pseudocode | Yes | Algorithm 1 T-MARS Input: Dataset S = {i, t}n, score function ℓ, image masking function m Output: Filtered Pool S // Step 1: Text-Masking for k = 0 . . . n 1 do ik = m(ik) end for // Step 2: Re-Scoring for k = 0 . . . n 1 do sk = ℓ( ik, tk) end for α = Median ({sk}n k=1) return S = {(ik, tk) | sk α} |
| Open Source Code | No | The paper thanks the Datacomp and Open CLIP teams for their codebase used for training, but it does not provide an explicit statement or link for the open-source code of their own methodology, T-MARS. |
| Open Datasets | Yes | We first experiment on six different data pools ranging from 2M to 64M samples chosen from the LAION-400M dataset. Note that the compute budget (total training samples seen i.e. epochsPublished as a conference paper at ICLR 2024) is kept the same as the pool size. For example, for a 32M pool size, the total samples which can be seen during training is kept at 32M (i.e. 1 epoch over the whole dataset). In cases where filtering methods retain a smaller subset (say 16M samples) of the data pool, they get the advantage of running more iterations (2 epochs over 16M subset i.e. total 32M samples seen) over the chosen subset. Finally, we also experiment on the 12.8M (small scale) and 128M (medium scale) data pool of the recently released Data Comp. We use the implementation of the Datacomp library to standardize the training process. We train both Res Net 50 and Vi T-B-32 models with a batch size of 1024, using cosine learning rate with 200 steps of warmup at 5e 4. We use Adam W as the optimizer for training. All the experiments were performed on NVIDIA A6000 GPUs.5.2 EVALUATION DATASETSWe extensively evaluate zero-shot accuracies on a suite of benchmarks considered in prior work (Radford et al., 2021; Wortsman et al., 2021): (a) Image Net: a 1000-class image classification challenge (Russakovsky et al., 2015); (b) Image Net-OOD: Six associated imagenet distribution shifts Image Net V2 (Recht et al., 2019), Image Net-R (Hendrycks et al., 2020), Image Net-A (Hendrycks et al., 2019), Image Net-Sketch (Wang et al., 2019), Image Net-O (Hendrycks et al., 2019), and Object Net (Barbu et al., 2019); (c) VTAB: 12 datasets from the Visual Task Adaptation Benchmark (Zhai et al., 2020), including Caltech101, CIFAR100, DTD, Flowers102, Pets, SVHN, Resisc45, Euro SAT, Patch Camelyon, Clevr Counts, Clevr Distance, KITTI and Sun397; and (d) Retrieval: 3 retrieval tasks of MSCOCO (Chen et al., 2015), Flickr (Young et al., 2014) and Wino GAVi L (Bitton et al., 2022). |
| Dataset Splits | No | The paper describes training on subsets of the LAION dataset and evaluating on external benchmark datasets (e.g., ImageNet, VTAB, Data Comp, Imagenette). While it mentions using Conceptual Captions (CC3M) for fine-tuning a validation model, it does not specify explicit train/validation/test splits of its primary LAION-based data pools for the main experiments. |
| Hardware Specification | Yes | All the experiments were performed on NVIDIA A6000 GPUs. |
| Software Dependencies | No | The paper mentions using the 'Datacomp library', 'Open CLIP', and 'MMOCR library', but it does not specify version numbers for these software components or any other key dependencies. |
| Experiment Setup | Yes | We train both Res Net 50 and Vi T-B-32 models with a batch size of 1024, using cosine learning rate with 200 steps of warmup at 5e 4. We use Adam W as the optimizer for training.We train a randomly initialized Vi T-B-32 vision encoder with a pre-trained Ro BERTa text encoder for 120 steps of warmup followed by a cosine schedule with a maximum learning rate of 1e 3. The number of training steps is the same across all training runs (fixed at 600 steps at a batch size of 1024). |