reproducibilityindex.ai

Kraken: Inherently Parallel Transformers For Efficient Multi-Device Inference

Authors: Rohan Baskar Prabhakar, Hengrui Zhang, David Wentzlaff

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate the improvements Kraken offers over standard Transformers in two key aspects: model quality and inference latency. For the former, we train a series of Kraken models with varying degrees of parallelism and parameter count on Open Web Text (23) and compare them with the GPT-2 (44) family of models on the Super GLUE suite of benchmarks (53). We then implement Kraken using the Tensor RT-LLM library (15) and measure the Time To First Token (TTFT) given various model sizes and context lengths to illustrate the efficiency gains when collective operators are no longer on the critical path.
Researcher Affiliation	Academia	Rohan Baskar Prabhakar Princeton University rohanbp@princeton.edu Hengrui Zhang Princeton University hengrui.zhang@princeton.edu David Wentzlaff Princeton University wentzlaf@princeton.edu
Pseudocode	Yes	Algorithm 1: Kraken Sub-Layer: Forward Pass
Open Source Code	Yes	Pertinent code including the Tensor RT-LLM implementation is available at https://github.com/rohan-bp/kraken.
Open Datasets	Yes	To evaluate language modeling performance, we train a series of models up to 761 million parameters large and with varying degrees of parallelism on Open Web Text (23).
Dataset Splits	No	The paper reports 'Validation Perplexity' in Table 1, implying the use of a validation set, but it does not provide specific details on the dataset splits (e.g., exact percentages or sample counts for training, validation, and test sets).
Hardware Specification	Yes	All experiments were conducted on a 8 x A100 GPU machine with NVSwitch and 40GB of HBM memory per GPU.
Software Dependencies	Yes	We used Tensor RT-LLM version 0.12.0.dev2024073000 throughout the evaluation.
Experiment Setup	Yes	For all pretrained models presented in Section 4.1, we used a similarly sized GPT-3 (7) model s hyperparameters as the basis and followed the procedure outlined in Section 3.2 to calculate the embedding dimension. We did not make an effort to optimize the codebase used for training which builds off of nano GPT (29). It is possible to replicate pretrained models by extending nano GPT to implement the new forward pass as described in Algorithm 1. The Adam optimizer was used to train all models along with a cosine learning rate decay with linear warmup. Initial learning rates and the approximate GPU hours required to train each configuration are presented in Table 5. All models were trained for 300, 000 gradient steps.