reproducibilityindex.ai

RingAttention with Blockwise Transformers for Near-Infinite Context

Authors: Hao Liu, Matei Zaharia, Pieter Abbeel

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on language modeling and reinforcement learning tasks demonstrate the effectiveness of our approach in allowing millions of tokens context size and improving performance.
Researcher Affiliation	Academia	Hao Liu, Matei Zaharia, Pieter Abbeel UC Berkeley
Pseudocode	Yes	Algorithm 1 Large Context Transformers using Ring Attention with Blockwise Transformers.
Open Source Code	Yes	We provide the complete code on github.
Open Datasets	Yes	We use user-shared conversations gathered from Share GPT.com with its public APIs for finetuning, following methodologies as outlined in prior works (Chiang et al., 2023; Geng et al., 2023).
Dataset Splits	No	The paper describes training and test procedures (e.g., 'Training Configuration', 'Evaluating Max Context Size'), but it does not explicitly specify validation dataset splits or methodology, or mention a 'validation set' in the context of data partitioning.
Hardware Specification	Yes	For GPUs, we consider both single DGX A100 server with 8 GPUs and distributed 32 A100 GPUs. We also experiment with TPUs, from older generations TPUv3 to newer generations of TPUv4 and TPUv5e.
Software Dependencies	No	The paper mentions using 'Jax' for its implementation ('A Jax implementation is provided in Appendix A.') and 'jax.lax.ppermute' for operations, but it does not specify version numbers for Jax or any other software dependencies.
Experiment Setup	Yes	Model Configuration. Our study is built upon the LLa MA architecture, we consider 3B, 7B, 13B, and 30B model sizes in our experiments. Training Configuration. For all methods, we apply full gradient checkpointing (Chen et al., 2016) to both attention and feedforward, following prior works (Rabe and Staats, 2021; Liu and Abbeel, 2023b). The batch size in tokens are 2M on 8/32x A100 and 4M on TPUv4-256.