reproducibilityindex.ai

Data Selection for Language Models via Importance Resampling

Authors: Sang Michael Xie, Shibani Santurkar, Tengyu Ma, Percy S. Liang

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We propose Data Selection with Importance Resampling (DSIR), an efficient and scalable framework that estimates importance weights in a reduced feature space for tractability and selects data with importance resampling according to these weights. We instantiate the DSIR framework with hashed n-gram features for efficiency, enabling the selection of 100M documents from the full Pile dataset in 4.5 hours. To measure whether hashed n-gram features preserve the aspects of the data that are relevant to the target, we define KL reduction, a data metric that measures the proximity between the selected pretraining data and the target on some feature space. Across 8 data selection methods (including expert selection), KL reduction on hashed n-gram features highly correlates with average downstream accuracy (r = 0.82). When selecting data for continued pretraining on a specific domain, DSIR performs comparably to expert curation across 8 target distributions. When pretraining general-domain models (target is Wikipedia and books), DSIR improves over random selection and heuristic filtering baselines by 2 2.5% on the GLUE benchmark.
Researcher Affiliation	Academia	Sang Michael Xie, Shibani Santurkar, Tengyu Ma, Percy Liang Department of Computer Science Stanford University {xie, shibani, tengyuma, pliang}@cs.stanford.edu
Pseudocode	No	The paper describes the steps of its framework in prose and lists steps in appendices, but it does not include formally labeled 'Pseudocode' or 'Algorithm' blocks or figures.
Open Source Code	Yes	Code, selected data, and pretrained models are available at https://github.com/p-lambda/dsir.
Open Datasets	Yes	We consider selecting data from The Pile (1.6B examples) for continued pretraining of domain-specific LMs and training general-domain LMs from scratch. On 8 datasets from 4 domains (CS papers, biomedical papers, news, reviews), DSIR improves over Ro BERTa (no continued pretraining) by 2% on average and is even comparable to continued pretraining on expert-curated data from Gururangan et al. [24]. For general-domain LMs (Section 7), the data selection target is formal, clean text from Wikipedia and books, following GPT-3 [10].
Dataset Splits	Yes	We reserve chunk 0 for validation purposes and only consider the last 29 chunks. We train the importance weight estimator or fasttext classifier from The Pile validation set, where the target is Wikipedia + Book Corpus2 + Gutenberg + Books3 and the raw data come from the rest of the data sources in The Pile.
Hardware Specification	Yes	DSIR selects 100M documents from the full Pile dataset in 4.5 hours on 1 CPU node with 96 cores. GPUs 4 Titan RTX (Table 6, 7, 10, 11).
Software Dependencies	No	The paper mentions software components like 'fasttext' and uses specific model architectures (e.g., 'BERT-base', 'RoBERTa-base'), but it does not provide specific version numbers for software libraries or frameworks (e.g., PyTorch, NLTK).
Experiment Setup	Yes	Table 6: Hyperparameters for training general-domain LMs from scratch. Table 7: Hyperparameters for continued pretraining of general-domain LMs. Table 8: Dataset-specific hyperparameters for fine-tuning LMs on GLUE. Table 9: Shared hyperparameters for fine-tuning LMs on GLUE. Table 10: Hyperparameters for continued pretraining on domain-specific data. Table 11: Hyperparameters for fine-tuning on domain-specific data.