Effective pruning of web-scale datasets based on complexity of concept clusters
Authors: Amro Kamal Mohamed Abbas, Evgenia Rusak, Kushal Tirumala, Wieland Brendel, Kamalika Chaudhuri, Ari S. Morcos
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We report results on three different datasets: 1. LAION-CAT-440M: (Radenovic et al., 2023) proposed a caption complexity, action, and text spotting filtering (CAT) method and filter the LAION-2B dataset to 440M examples (LAION-CAT-440M). We use Sem De Dup (Abbas et al., 2023) to reduce the size of this dataset to 280 million examples, and call it LAION-De Dup-280M. We refer the reader to (Radenovic et al., 2023) for more details about the LAION-CAT-440M dataset. 2. LAION-50M: a random subset from LAION-De Dup-280M. We use this dataset mainly for development and hyperparameter search. 3. Data Comp Medium dataset (Gadre et al., 2023): Since the LAION-CAT-440M dataset has already been pre-filtered in multiple ways, we complement our results on LAION by using a raw dataset with no filtering applied to it. We choose to use the Data Comp Medium dataset which consists of 128 million raw examples. Because of link failures we were able to download 120 million examples from Data Comp. Pruning the LAION dataset. For all experiments on LAION, we focus on the training cost we save. Thus, we follow a fixed and simple setting of filtering the dataset to 60% of its original size after deduplication. Therefore, we prune LAION-De Dup-280M and LAION-50M to 166M and 30M examples, respectively. For LAION-De Dup-280M, we also experiment with pruning to 28% and 40% of its original size. Unless stated otherwise, we train for 32 epochs. For our Density Based Pruning method, we use image embeddings from a distilled DINOV2-L/14 model (Oquab et al., 2023). We find that using the distilled DINOV2-L/14 embeddings works better than using multimodal embeddings as discussed in Section 5. We tune the number of clusters for k-means on LAION-De Dup-280M and use k=500 (see Section 5.4). |
| Researcher Affiliation | Collaboration | University of T ubingen, Germany1 Meta AI (FAIR)2 ELLIS Institute T ubingen3 Max-Planck Institute for Intelligent Systems4 T ubingen AI Center5 University of California San Diego6 Datology AI7 |
| Pseudocode | Yes | Table 4: Python code for the quadratic program solver 2 import numpy as np 3 import torch 4 from qpsolvers import solve_qp 6 # Input: d_inter (List), d_intra (List), temp (float), num_centroids (int ), filtered_dataset_size (int), num_items_in_each_cluster (List) 8 # Output: X (list) <Number of samples per cluster 10 softmax = torch.nn.Softmax() 11 probs = softmax( (d_inter * d_intra)/temp ) 12 P = np.eye(num_centroids) 13 q = probs * filtered_dataset_size 14 A = np.array(1.0 * num_centroids) 15 b = np.array([filtered_dataset_size]) 17 # Define the lower and upper bounds 18 min_samples = 1 19 bounds = np.array([ ( min_samples, num_items_in_each_cluster[i] ) 20 for i in range(num_centroids) ] 22 X = solve_qp(P=P, q=q, A=A, b=b, 23 lb=bounds[:,0], ub=[:,1], solver= osqp ) |
| Open Source Code | Yes | Code at github.com/amro-kamal/effective_ pruning. |
| Open Datasets | Yes | 1. LAION-CAT-440M: (Radenovic et al., 2023) proposed a caption complexity, action, and text spotting filtering (CAT) method and filter the LAION-2B dataset to 440M examples (LAION-CAT-440M). 2. LAION-50M: a random subset from LAION-De Dup-280M. We use this dataset mainly for development and hyperparameter search. 3. Data Comp Medium dataset (Gadre et al., 2023): Since the LAION-CAT-440M dataset has already been pre-filtered in multiple ways, we complement our results on LAION by using a raw dataset with no filtering applied to it. We choose to use the Data Comp Medium dataset which consists of 128 million raw examples. Because of link failures we were able to download 120 million examples from Data Comp. |
| Dataset Splits | Yes | Evaluation We use zero-shot accuracy for all evaluations and report the top-1 zero-shot accuracy on Image Net in addition to the Data Comp evaluation protocol and evaluate on a suite of 38 image classification and retrieval tasks including the VTAB tasks (Zhai et al., 2019b), Image Net distribution shift tasks, and retrieval tasks. All the evaluation datasets we use are listed in Table 10. |
| Hardware Specification | No | The paper mentions hardware in the context of prior work: "For example, training of the Vi T-L/14 model (Dosovitskiy et al., 2020) with Open CLIP took 400 A100 (40 GB) GPUs for around 127 hours." However, it does not specify the hardware used for *their own* experiments. The |
| Software Dependencies | Yes | We use different open-source software packages for our experiments, most notably SLURM (Yoo et al., 2003), Open CLIP (Ilharco et al., 2021), scipy and numpy (Virtanen et al., 2020), GNU parallel (Tange, 2011), Faiss (Johnson et al., 2019), Py Torch (Paszke et al., 2017) and torchvision (Marcel & Rodriguez, 2010). |
| Experiment Setup | Yes | Other Hyperparameters We train the CLIP-Vi T-B/32 models using the Open CLIP (Ilharco et al., 2021) default hyperparameters for both LAION and Data Comp datsets and fix the training seed. We list the values of different hyperparameters in Table 6, Appendix E. Table 6: Training parameters for CLIP. We follow the standard hyperparameters used for each dataset. We use the Open CLIP hyperparameters for experiments on the LAION dataset and the Data Comp hyperparameters for experiments on the Data Comp Medium dataset. Parameter Value Model CLIP Vi T-B-32 Warmup (LAION) 2000 training steps Warmup (Data Comp) 500 training steps Batch size (LAION) 33,792 Batch size (Data Comp) 4,096 Learning rate 5.0e-4, cosine scheduler Optimizer Adam W, wd=0.2, betas=(0.9, 0.98), eps=1.0e-6 |