reproducibilityindex.ai

T-MARS: Improving Visual Representations by Circumventing Text Feature Learning

Authors: Pratyush Maini, Sachin Goyal, Zachary Chase Lipton, J Zico Kolter, Aditi Raghunathan

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimentally, T-MARS is the top ranked approach on Imagenet at medium scale of Data Comp (a data filtering benchmark), and outperforms CLIP filtering by a margin of 6.5% on Image Net and 4.7% on VTAB.We extensively evaluate zero-shot accuracies on a suite of benchmarks considered in prior work (Radford et al., 2021; Wortsman et al., 2021): (a) Image Net: a 1000-class image classification challenge (Russakovsky et al., 2015); (b) Image Net-OOD: Six associated imagenet distribution shifts Image Net V2 (Recht et al., 2019), Image Net-R (Hendrycks et al., 2020), Image Net-A (Hendrycks et al., 2019), Image Net-Sketch (Wang et al., 2019), Image Net-O (Hendrycks et al., 2019), and Object Net (Barbu et al., 2019); (c) VTAB: 12 datasets from the Visual Task Adaptation Benchmark (Zhai et al., 2020), including Caltech101, CIFAR100, DTD, Flowers102, Pets, SVHN, Resisc45, Euro SAT, Patch Camelyon, Clevr Counts, Clevr Distance, KITTI and Sun397; and (d) Retrieval: 3 retrieval tasks of MSCOCO (Chen et al., 2015), Flickr (Young et al., 2014) and Wino GAVi L (Bitton et al., 2022).
Researcher Affiliation	Collaboration	Pratyush Maini Sachin Goyal Zachary C. Lipton J. Zico Kolter Aditi Raghunathan Carnegie Mellon University Bosch Center for AI {pratyushmaini,sachingoyal,zlipton,zkolter,raditi}@cmu.edu
Pseudocode	Yes	Algorithm 1 T-MARS Input: Dataset S = {i, t}n, score function ℓ, image masking function m Output: Filtered Pool S // Step 1: Text-Masking for k = 0 . . . n 1 do ik = m(ik) end for // Step 2: Re-Scoring for k = 0 . . . n 1 do sk = ℓ( ik, tk) end for α = Median ({sk}n k=1) return S = {(ik, tk) \| sk α}
Open Source Code	No	The paper thanks the Datacomp and Open CLIP teams for their codebase used for training, but it does not provide an explicit statement or link for the open-source code of their own methodology, T-MARS.
Open Datasets	Yes	We first experiment on six different data pools ranging from 2M to 64M samples chosen from the LAION-400M dataset. Note that the compute budget (total training samples seen i.e. epochsPublished as a conference paper at ICLR 2024) is kept the same as the pool size. For example, for a 32M pool size, the total samples which can be seen during training is kept at 32M (i.e. 1 epoch over the whole dataset). In cases where filtering methods retain a smaller subset (say 16M samples) of the data pool, they get the advantage of running more iterations (2 epochs over 16M subset i.e. total 32M samples seen) over the chosen subset. Finally, we also experiment on the 12.8M (small scale) and 128M (medium scale) data pool of the recently released Data Comp. We use the implementation of the Datacomp library to standardize the training process. We train both Res Net 50 and Vi T-B-32 models with a batch size of 1024, using cosine learning rate with 200 steps of warmup at 5e 4. We use Adam W as the optimizer for training. All the experiments were performed on NVIDIA A6000 GPUs.5.2 EVALUATION DATASETSWe extensively evaluate zero-shot accuracies on a suite of benchmarks considered in prior work (Radford et al., 2021; Wortsman et al., 2021): (a) Image Net: a 1000-class image classification challenge (Russakovsky et al., 2015); (b) Image Net-OOD: Six associated imagenet distribution shifts Image Net V2 (Recht et al., 2019), Image Net-R (Hendrycks et al., 2020), Image Net-A (Hendrycks et al., 2019), Image Net-Sketch (Wang et al., 2019), Image Net-O (Hendrycks et al., 2019), and Object Net (Barbu et al., 2019); (c) VTAB: 12 datasets from the Visual Task Adaptation Benchmark (Zhai et al., 2020), including Caltech101, CIFAR100, DTD, Flowers102, Pets, SVHN, Resisc45, Euro SAT, Patch Camelyon, Clevr Counts, Clevr Distance, KITTI and Sun397; and (d) Retrieval: 3 retrieval tasks of MSCOCO (Chen et al., 2015), Flickr (Young et al., 2014) and Wino GAVi L (Bitton et al., 2022).
Dataset Splits	No	The paper describes training on subsets of the LAION dataset and evaluating on external benchmark datasets (e.g., ImageNet, VTAB, Data Comp, Imagenette). While it mentions using Conceptual Captions (CC3M) for fine-tuning a validation model, it does not specify explicit train/validation/test splits of its primary LAION-based data pools for the main experiments.
Hardware Specification	Yes	All the experiments were performed on NVIDIA A6000 GPUs.
Software Dependencies	No	The paper mentions using the 'Datacomp library', 'Open CLIP', and 'MMOCR library', but it does not specify version numbers for these software components or any other key dependencies.
Experiment Setup	Yes	We train both Res Net 50 and Vi T-B-32 models with a batch size of 1024, using cosine learning rate with 200 steps of warmup at 5e 4. We use Adam W as the optimizer for training.We train a randomly initialized Vi T-B-32 vision encoder with a pre-trained Ro BERTa text encoder for 120 steps of warmup followed by a cosine schedule with a maximum learning rate of 1e 3. The number of training steps is the same across all training runs (fixed at 600 steps at a batch size of 1024).