Blockwise Parallel Transformers for Large Context Models
Authors: Hao Liu, Pieter Abbeel
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on language modeling and reinforcement learning tasks demonstrate the effectiveness of BPT in reducing memory requirements and improving performance. |
| Researcher Affiliation | Academia | Hao Liu UC Berkeley hao.liu@cs.berkeley.edu Pieter Abbeel UC Berkeley pabbeel@cs.berkeley.edu |
| Pseudocode | Yes | Algorithm 1 provides the pseudocode of the algorithm. |
| Open Source Code | Yes | The full code of BPT is provided at Git Hub 1 which supports large-scale distributed training of large context models using BPT. 1https://github.com/lhao499/llm_large_context |
| Open Datasets | Yes | We consider two datasets for evaluation purposes. Including pretraining on Open Web Text dataset and large context reinforcement learning on Exo RL. ... Open Web Text. The Open Web Text dataset [18] ... URL http://Skylion007.github. io/Open Web Text Corpus. ... Exo RL. The Exo RL [58] dataset is based on unlabeled exploratory data collected by running unsupervised RL algorithms. |
| Dataset Splits | No | The paper refers to training and testing but does not explicitly provide specific train/validation/test dataset splits (percentages or counts), nor does it reference predefined splits with clear citations for data partitioning. |
| Hardware Specification | Yes | The experiments are on NVIDIA 80GB A100 GPUs, we consider both single GPU for smaller model training and 8 GPUs settings for model parallel training. We also experiment with scaling up model on 64 TPUv4. |
| Software Dependencies | No | The paper mentions 'Jax documentation' and 'FSDP [16]' but does not provide specific version numbers for these software components or any other key libraries/languages. |
| Experiment Setup | Yes | We tune the block size for both the baselines and BPT, and report the best results achieved by each. ... For gradient checkpointing [8], we additionally grid search among three commonly used checkpointing policies... The training was conducted using FSDP [16] and gradient accumulation. ... For sequence lengths of 2048, 4096, 8192, 16384, the batch sizes in trajectories were set as 8, 4, 2, 1, 1 respectively. ... The specific hyperparameters are provided in Table 6. |