Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Learning to Compress Prompts with Gist Tokens
Authors: Jesse Mu, Xiang Li, Noah Goodman
NeurIPS 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | On decoder (LLa MA-7B) and encoder-decoder (FLAN-T5-XXL) LMs, gisting enables up to 26x compression of prompts, resulting in up to 40% FLOPs reductions, 4.2% wall time speedups, and storage savings, all with minimal loss in output quality. |
| Researcher Affiliation | Academia | Jesse Mu, Xiang Lisa Li, Noah Goodman Stanford University EMAIL, EMAIL |
| Pseudocode | Yes | A Example Py Torch Implementation of Gist Masking |
| Open Source Code | Yes | Code, data, and model checkpoints are available at https://github.com/jayelm/gisting. |
| Open Datasets | Yes | To obtain the largest possible set of tasks for instruction finetuning, we create a dataset called Alpaca+, which combines the Self-Instruct [36] and Stanford Alpaca [31] instruction tuning datasets, each consisting of (t, x, y) tuples sampled from Open AI s text-davinci-001 and text-davinci-003 variants of GPT-3, respectively. |
| Dataset Splits | Yes | From Alpaca+ we hold out 3 validation splits: 1000 Seen prompts (with unseen, non-empty inputs); 1000 Unseen prompts (with non-empty inputs); and the 252 hand-written Human prompts and completions used in Wang et al. [36], of which 83% have non-empty inputs. |
| Hardware Specification | Yes | Experiments were run on a cluster machine with 4x A100-SXM4-80GB NVIDIA GPUs, 480GB RAM, and 16 CPUs, using Py Torch 2.0 [24], Hugging Face Transformers [41], and Deep Speed [29]. |
| Software Dependencies | Yes | Experiments were run on a cluster machine with 4x A100-SXM4-80GB NVIDIA GPUs, 480GB RAM, and 16 CPUs, using Py Torch 2.0 [24], Hugging Face Transformers [41], and Deep Speed [29]. |
| Experiment Setup | Yes | Full hyperparameters for training runs are located in Table A.1. |