RingAttention with Blockwise Transformers for Near-Infinite Context
Authors: Hao Liu, Matei Zaharia, Pieter Abbeel
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on language modeling and reinforcement learning tasks demonstrate the effectiveness of our approach in allowing millions of tokens context size and improving performance. |
| Researcher Affiliation | Academia | Hao Liu, Matei Zaharia, Pieter Abbeel UC Berkeley |
| Pseudocode | Yes | Algorithm 1 Large Context Transformers using Ring Attention with Blockwise Transformers. |
| Open Source Code | Yes | We provide the complete code on github. |
| Open Datasets | Yes | We use user-shared conversations gathered from Share GPT.com with its public APIs for finetuning, following methodologies as outlined in prior works (Chiang et al., 2023; Geng et al., 2023). |
| Dataset Splits | No | The paper describes training and test procedures (e.g., 'Training Configuration', 'Evaluating Max Context Size'), but it does not explicitly specify validation dataset splits or methodology, or mention a 'validation set' in the context of data partitioning. |
| Hardware Specification | Yes | For GPUs, we consider both single DGX A100 server with 8 GPUs and distributed 32 A100 GPUs. We also experiment with TPUs, from older generations TPUv3 to newer generations of TPUv4 and TPUv5e. |
| Software Dependencies | No | The paper mentions using 'Jax' for its implementation ('A Jax implementation is provided in Appendix A.') and 'jax.lax.ppermute' for operations, but it does not specify version numbers for Jax or any other software dependencies. |
| Experiment Setup | Yes | Model Configuration. Our study is built upon the LLa MA architecture, we consider 3B, 7B, 13B, and 30B model sizes in our experiments. Training Configuration. For all methods, we apply full gradient checkpointing (Chen et al., 2016) to both attention and feedforward, following prior works (Rabe and Staats, 2021; Liu and Abbeel, 2023b). The batch size in tokens are 2M on 8/32x A100 and 4M on TPUv4-256. |