RingAttention with Blockwise Transformers for Near-Infinite Context

Authors: Hao Liu, Matei Zaharia, Pieter Abbeel

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on language modeling and reinforcement learning tasks demonstrate the effectiveness of our approach in allowing millions of tokens context size and improving performance.
Researcher Affiliation Academia Hao Liu, Matei Zaharia, Pieter Abbeel UC Berkeley
Pseudocode Yes Algorithm 1 Large Context Transformers using Ring Attention with Blockwise Transformers.
Open Source Code Yes We provide the complete code on github.
Open Datasets Yes We use user-shared conversations gathered from Share GPT.com with its public APIs for finetuning, following methodologies as outlined in prior works (Chiang et al., 2023; Geng et al., 2023).
Dataset Splits No The paper describes training and test procedures (e.g., 'Training Configuration', 'Evaluating Max Context Size'), but it does not explicitly specify validation dataset splits or methodology, or mention a 'validation set' in the context of data partitioning.
Hardware Specification Yes For GPUs, we consider both single DGX A100 server with 8 GPUs and distributed 32 A100 GPUs. We also experiment with TPUs, from older generations TPUv3 to newer generations of TPUv4 and TPUv5e.
Software Dependencies No The paper mentions using 'Jax' for its implementation ('A Jax implementation is provided in Appendix A.') and 'jax.lax.ppermute' for operations, but it does not specify version numbers for Jax or any other software dependencies.
Experiment Setup Yes Model Configuration. Our study is built upon the LLa MA architecture, we consider 3B, 7B, 13B, and 30B model sizes in our experiments. Training Configuration. For all methods, we apply full gradient checkpointing (Chen et al., 2016) to both attention and feedforward, following prior works (Rabe and Staats, 2021; Liu and Abbeel, 2023b). The batch size in tokens are 2M on 8/32x A100 and 4M on TPUv4-256.