Kraken: Inherently Parallel Transformers For Efficient Multi-Device Inference
Authors: Rohan Baskar Prabhakar, Hengrui Zhang, David Wentzlaff
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate the improvements Kraken offers over standard Transformers in two key aspects: model quality and inference latency. For the former, we train a series of Kraken models with varying degrees of parallelism and parameter count on Open Web Text (23) and compare them with the GPT-2 (44) family of models on the Super GLUE suite of benchmarks (53). We then implement Kraken using the Tensor RT-LLM library (15) and measure the Time To First Token (TTFT) given various model sizes and context lengths to illustrate the efficiency gains when collective operators are no longer on the critical path. |
| Researcher Affiliation | Academia | Rohan Baskar Prabhakar Princeton University rohanbp@princeton.edu Hengrui Zhang Princeton University hengrui.zhang@princeton.edu David Wentzlaff Princeton University wentzlaf@princeton.edu |
| Pseudocode | Yes | Algorithm 1: Kraken Sub-Layer: Forward Pass |
| Open Source Code | Yes | Pertinent code including the Tensor RT-LLM implementation is available at https://github.com/rohan-bp/kraken. |
| Open Datasets | Yes | To evaluate language modeling performance, we train a series of models up to 761 million parameters large and with varying degrees of parallelism on Open Web Text (23). |
| Dataset Splits | No | The paper reports 'Validation Perplexity' in Table 1, implying the use of a validation set, but it does not provide specific details on the dataset splits (e.g., exact percentages or sample counts for training, validation, and test sets). |
| Hardware Specification | Yes | All experiments were conducted on a 8 x A100 GPU machine with NVSwitch and 40GB of HBM memory per GPU. |
| Software Dependencies | Yes | We used Tensor RT-LLM version 0.12.0.dev2024073000 throughout the evaluation. |
| Experiment Setup | Yes | For all pretrained models presented in Section 4.1, we used a similarly sized GPT-3 (7) model s hyperparameters as the basis and followed the procedure outlined in Section 3.2 to calculate the embedding dimension. We did not make an effort to optimize the codebase used for training which builds off of nano GPT (29). It is possible to replicate pretrained models by extending nano GPT to implement the new forward pass as described in Algorithm 1. The Adam optimizer was used to train all models along with a cosine learning rate decay with linear warmup. Initial learning rates and the approximate GPU hours required to train each configuration are presented in Table 5. All models were trained for 300, 000 gradient steps. |