Dataset Decomposition: Faster LLM Training with Variable Sequence Length Curriculum
Authors: Hadi Pouransari, Chun-Liang Li, Jen-Hao Chang, Pavan Kumar Anasosalu Vasu, Cem Koc, Vaishaal Shankar, Oncel Tuzel
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We train an 8k context-length 1B model at the same cost as a 2k context-length model trained with the baseline approach. Experiments on a webscale corpus demonstrate that our approach significantly enhances performance on standard language evaluations and long-context benchmarks, reaching target accuracy with up to 6 faster training compared to the baseline. |
| Researcher Affiliation | Industry | Hadi Pouransari1, Chun-Liang Li1 Jen-Hao Rick Chang1 Pavan Kumar Anasosalu Vasu1 Cem Koc1 Vaishaal Shankar2, Oncel Tuzel1 1Apple 2Anthropic |
| Pseudocode | Yes | Algorithm 1 Length based sampling and curriculum |
| Open Source Code | Yes | Corresponding author: mpouransari@apple.com, Work is done when at Apple. *Code to be available at https://github.com/apple/ml-dataset-decomposition. |
| Open Datasets | Yes | For all experiments, except the results in Section 3.5, we use Refined Web [46] filtering of Common Crawl [1] with a total of 525 billion tokens using the Eleuther AI/gpt-neox [9] tokenizer (vocabulary size is 50,432). |
| Dataset Splits | No | The paper refers to using evaluation benchmarks, but does not explicitly detail the split percentages or sample counts for training, validation, and test sets. It mentions 'evaluations' and 'benchmarks' for accuracy metrics. |
| Hardware Specification | Yes | Software and hardware details All experiments in this paper are conducted using the Open LM|| repository, which is based on Py Torch. We use Fully Sharded Data Parallelism (FSDP) with Bfloat16 mixed precision for all experiments. We use the Xformers [29] implementation for attention. For hardware, we use one or more nodes of 8 NVIDIA H100 GPUs (Hopper architecture), each with 80GB memory, and 192 CPU cores with 2000GB of RAM. Nodes are connected through Elastic Fabric Adapter (EFA) for efficient inter-node communication hosted by AWS. |
| Software Dependencies | No | The paper mentions Py Torch, FSDP, and Xformers [29], but does not provide specific version numbers for these software dependencies, which would be necessary for full reproducibility. |
| Experiment Setup | Yes | Baseline hyper parameters We list our baseline hyperparameters in Table 12 and iterate over changes for each section next. Note that we did not explicitly optimize hyperparameters for any of the experiments, and we always use the same hyperparameters when using either the baseline method or ours. |