Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Automatic Data Curation for Self-Supervised Learning: A Clustering-Based Approach

Authors: Huy V. Vo, Vasil Khalidov, Timothée Darcet, Théo Moutakanni, Nikita Smetanin, Marc Szafraniec, Hugo Touvron, camille couprie, Maxime Oquab, Armand Joulin, Herve Jegou, Patrick Labatut, Piotr Bojanowski

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on three different data domains including web-based images, satellite images and text show that features trained on our automatically curated datasets outperform those trained on uncurated data while being on par or better than ones trained on manually curated data.
Researcher Affiliation Collaboration 1Meta Fundamental AI Research (FAIR) 2INRIA 3Université Paris Saclay 4Google
Pseudocode Yes Algorithm 1: Hierarchical k-means with resampling algorithm.
Open Source Code Yes Our code is publicly available at https://github.com/facebookresearch/ssl-data-curation.
Open Datasets Yes We apply our curation algorithm to a pool of web-based images. It is assembled by following links from <img> tags in web pages of a publicly available repository of crawled web data. ... We train a Vi T-L with DINOv21 on Image Net1k (Russakovsky et al., 2015) and use it as our base feature extractor. ... The first installment of Llama (Touvron et al., 2023) was trained on a mix of datasets, some of which are high-quality datasets from narrow domains while most of the training data was a variant of CCNET (Wenzek et al., 2019), a heuristic Wikipedia-based curation applied to text from Common Crawl. ... Finally, we evaluate features fairness on Dollar Street dataset (De Vries et al., 2019).
Dataset Splits Yes We report top-1 accuracy on k-nn and linear classification on the 1000 classes of Image Net. Apart from the standard validation set, we also consider alternative test sets Image Net V2 (Recht et al., 2019) and Image Net-Rea L (Beyer et al., 2020). ... We artificially generate unbalanced variants of Image Net by resampling this dataset such that the class sizes follow a power law with the scaling exponent α taken in {0.5, 1, 2}. ... For our experiments, we build a raw pool of 18 million images in a similar manner to Tolan et al. (2023). ... We then train a DINOv2-reg Vi T-L on both the curated dataset and the raw data pool, and evaluate the canopy height estimators trained with these two backbones. We follow the same evaluation protocol as Tolan et al. (2023) and report the Mean Average Error (MAE) and block R2 (r2) metrics on the test sets. They include the NEON test set, which contains images from sites not present in the decoder s training data, the California Brande dataset (Brande, 2021), the Sao Paulo dataset (dos Santos et al., 2019), which contains much higher trees than those in NEON, and the Aerial NEON test set which contain images acquired by drones instead of satellites.
Hardware Specification No In order to run k-means and k-means++ initialization at a large scale, we implement a distributed GPU-supported version of this algorithm in Py Torch (Paszke et al., 2019).
Software Dependencies No In order to run k-means and k-means++ initialization at a large scale, we implement a distributed GPU-supported version of this algorithm in Py Torch (Paszke et al., 2019). Our main run involves a 4-level hierarchical k-means on this image pool with 10M, 500k, 50k and 10k clusters in the first, second, third and fourth levels. ... We use the all-mpnet-base-v2 model from SBERT (Reimers & Gurevych, 2019) to represent documents.
Experiment Setup Yes We train a Vi T-L with DINOv2 on Image Net1k (Russakovsky et al., 2015) and use it as our base feature extractor. ... Our main run involves a 4-level hierarchical k-means on this image pool with 10M, 500k, 50k and 10k clusters in the first, second, third and fourth levels. ... We perform all our ablation studies with a Vi T-L. We use the original training recipe from DINOv2 (Oquab et al., 2023) in all our experiments, except for a smaller learning rate of 5 10 5 for Vi T-g. ... We train a language model with 7B parameters on a schedule for 210B tokens following Touvron et al. (2023).