DsDm: Model-Aware Dataset Selection with Datamodels

Authors: Logan Engstrom, Axel Feldmann, Aleksander Madry

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our resulting method greatly improves language model (LM) performance on both pre-specified tasks and previously unseen tasks. Specifically, choosing target tasks representative of standard LM problems and evaluating on diverse held-out benchmarks, our selected datasets provide a 2 compute multiplier over baseline methods.
Researcher Affiliation Academia Logan Engstrom 1 Axel Feldmann 1 Aleksander M adry 1 1MIT. Correspondence to: Logan Engstrom <engstrom@mit.edu>.
Pseudocode No The paper does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code No The paper mentions adapting implementations for baseline methods (DSIR and CLASSIFIER) and using an existing implementation for Sem De Dup, but does not provide a link or statement about open-sourcing the code for their proposed DSDM method.
Open Datasets Yes Our candidate dataset S is the English subset of the Colossal Cleaned Common Crawl (C4), a standard web scrape (Raffel et al., 2020). We consider four separate LM target tasks: LAMBADA (Paperno et al., 2016), CS-Algorithms (Srivastava et al., 2022), SQu AD (Rajpurkar et al., 2016), and Jeopardy (Tunguz, 2019).
Dataset Splits Yes For each considered target task, we split samples into a target set and a separate test set, and only use the target set to select training subsets. Our target set is 25% of the SQu AD train set (23107 examples), our holdout set is the SQu AD validation set (10557 examples). We split the LAMBADA test set into separate target and holdout sets, then remove 6 samples from the LAMBADA holdout set due to overlap with samples in our candidate train dataset... We conclude with 2570 holdout samples and 2577 target samples. We randomly split the test set into 660 target samples and 660 holdout samples.
Hardware Specification Yes We train on A100s (with BF16 precision) and H100s (with FP8 precision).
Software Dependencies No The paper mentions using "LLM-Foundry (Mosaic ML, 2023)" and the "BPE tokenizer used by Pythia (Biderman et al., 2023)", but does not specify version numbers for these or other software libraries (e.g., Python, PyTorch versions).
Experiment Setup Yes We train GPT-2 family decoder-only transformer models using LLM-Foundry... We use ADAM (β1 = 0.9, β2 = 0.95, ϵ = 10-8), sequence length 1024, batch size 1024, a cosine learning rate schedule (with 200 warm up batches and α = 0.1), and ℓ2 gradient clipping with threshold 1. Table 2 summarizes hyperparameters including LR, WD, dmodel, Heads, Layers, Tokens, and Batches for various model sizes.