reproducibilityindex.ai

Dataset Decomposition: Faster LLM Training with Variable Sequence Length Curriculum

Authors: Hadi Pouransari, Chun-Liang Li, Jen-Hao Chang, Pavan Kumar Anasosalu Vasu, Cem Koc, Vaishaal Shankar, Oncel Tuzel

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We train an 8k context-length 1B model at the same cost as a 2k context-length model trained with the baseline approach. Experiments on a webscale corpus demonstrate that our approach significantly enhances performance on standard language evaluations and long-context benchmarks, reaching target accuracy with up to 6 faster training compared to the baseline.
Researcher Affiliation	Industry	Hadi Pouransari1, Chun-Liang Li1 Jen-Hao Rick Chang1 Pavan Kumar Anasosalu Vasu1 Cem Koc1 Vaishaal Shankar2, Oncel Tuzel1 1Apple 2Anthropic
Pseudocode	Yes	Algorithm 1 Length based sampling and curriculum
Open Source Code	Yes	Corresponding author: mpouransari@apple.com, Work is done when at Apple. *Code to be available at https://github.com/apple/ml-dataset-decomposition.
Open Datasets	Yes	For all experiments, except the results in Section 3.5, we use Refined Web [46] filtering of Common Crawl [1] with a total of 525 billion tokens using the Eleuther AI/gpt-neox [9] tokenizer (vocabulary size is 50,432).
Dataset Splits	No	The paper refers to using evaluation benchmarks, but does not explicitly detail the split percentages or sample counts for training, validation, and test sets. It mentions 'evaluations' and 'benchmarks' for accuracy metrics.
Hardware Specification	Yes	Software and hardware details All experiments in this paper are conducted using the Open LM\|\| repository, which is based on Py Torch. We use Fully Sharded Data Parallelism (FSDP) with Bfloat16 mixed precision for all experiments. We use the Xformers [29] implementation for attention. For hardware, we use one or more nodes of 8 NVIDIA H100 GPUs (Hopper architecture), each with 80GB memory, and 192 CPU cores with 2000GB of RAM. Nodes are connected through Elastic Fabric Adapter (EFA) for efficient inter-node communication hosted by AWS.
Software Dependencies	No	The paper mentions Py Torch, FSDP, and Xformers [29], but does not provide specific version numbers for these software dependencies, which would be necessary for full reproducibility.
Experiment Setup	Yes	Baseline hyper parameters We list our baseline hyperparameters in Table 12 and iterate over changes for each section next. Note that we did not explicitly optimize hyperparameters for any of the experiments, and we always use the same hyperparameters when using either the baseline method or ours.