Kraken: Inherently Parallel Transformers For Efficient Multi-Device Inference

Authors: Rohan Baskar Prabhakar, Hengrui Zhang, David Wentzlaff

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate the improvements Kraken offers over standard Transformers in two key aspects: model quality and inference latency. For the former, we train a series of Kraken models with varying degrees of parallelism and parameter count on Open Web Text (23) and compare them with the GPT-2 (44) family of models on the Super GLUE suite of benchmarks (53). We then implement Kraken using the Tensor RT-LLM library (15) and measure the Time To First Token (TTFT) given various model sizes and context lengths to illustrate the efficiency gains when collective operators are no longer on the critical path.
Researcher Affiliation Academia Rohan Baskar Prabhakar Princeton University rohanbp@princeton.edu Hengrui Zhang Princeton University hengrui.zhang@princeton.edu David Wentzlaff Princeton University wentzlaf@princeton.edu
Pseudocode Yes Algorithm 1: Kraken Sub-Layer: Forward Pass
Open Source Code Yes Pertinent code including the Tensor RT-LLM implementation is available at https://github.com/rohan-bp/kraken.
Open Datasets Yes To evaluate language modeling performance, we train a series of models up to 761 million parameters large and with varying degrees of parallelism on Open Web Text (23).
Dataset Splits No The paper reports 'Validation Perplexity' in Table 1, implying the use of a validation set, but it does not provide specific details on the dataset splits (e.g., exact percentages or sample counts for training, validation, and test sets).
Hardware Specification Yes All experiments were conducted on a 8 x A100 GPU machine with NVSwitch and 40GB of HBM memory per GPU.
Software Dependencies Yes We used Tensor RT-LLM version 0.12.0.dev2024073000 throughout the evaluation.
Experiment Setup Yes For all pretrained models presented in Section 4.1, we used a similarly sized GPT-3 (7) model s hyperparameters as the basis and followed the procedure outlined in Section 3.2 to calculate the embedding dimension. We did not make an effort to optimize the codebase used for training which builds off of nano GPT (29). It is possible to replicate pretrained models by extending nano GPT to implement the new forward pass as described in Algorithm 1. The Adam optimizer was used to train all models along with a cosine learning rate decay with linear warmup. Initial learning rates and the approximate GPU hours required to train each configuration are presented in Table 5. All models were trained for 300, 000 gradient steps.