Blockwise Parallel Transformers for Large Context Models

Authors: Hao Liu, Pieter Abbeel

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on language modeling and reinforcement learning tasks demonstrate the effectiveness of BPT in reducing memory requirements and improving performance.
Researcher Affiliation Academia Hao Liu UC Berkeley hao.liu@cs.berkeley.edu Pieter Abbeel UC Berkeley pabbeel@cs.berkeley.edu
Pseudocode Yes Algorithm 1 provides the pseudocode of the algorithm.
Open Source Code Yes The full code of BPT is provided at Git Hub 1 which supports large-scale distributed training of large context models using BPT. 1https://github.com/lhao499/llm_large_context
Open Datasets Yes We consider two datasets for evaluation purposes. Including pretraining on Open Web Text dataset and large context reinforcement learning on Exo RL. ... Open Web Text. The Open Web Text dataset [18] ... URL http://Skylion007.github. io/Open Web Text Corpus. ... Exo RL. The Exo RL [58] dataset is based on unlabeled exploratory data collected by running unsupervised RL algorithms.
Dataset Splits No The paper refers to training and testing but does not explicitly provide specific train/validation/test dataset splits (percentages or counts), nor does it reference predefined splits with clear citations for data partitioning.
Hardware Specification Yes The experiments are on NVIDIA 80GB A100 GPUs, we consider both single GPU for smaller model training and 8 GPUs settings for model parallel training. We also experiment with scaling up model on 64 TPUv4.
Software Dependencies No The paper mentions 'Jax documentation' and 'FSDP [16]' but does not provide specific version numbers for these software components or any other key libraries/languages.
Experiment Setup Yes We tune the block size for both the baselines and BPT, and report the best results achieved by each. ... For gradient checkpointing [8], we additionally grid search among three commonly used checkpointing policies... The training was conducted using FSDP [16] and gradient accumulation. ... For sequence lengths of 2048, 4096, 8192, 16384, the batch sizes in trajectories were set as 8, 4, 2, 1, 1 respectively. ... The specific hyperparameters are provided in Table 6.