Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Learning to Compress Prompts with Gist Tokens

Authors: Jesse Mu, Xiang Li, Noah Goodman

NeurIPS 2023 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental On decoder (LLa MA-7B) and encoder-decoder (FLAN-T5-XXL) LMs, gisting enables up to 26x compression of prompts, resulting in up to 40% FLOPs reductions, 4.2% wall time speedups, and storage savings, all with minimal loss in output quality.
Researcher Affiliation Academia Jesse Mu, Xiang Lisa Li, Noah Goodman Stanford University EMAIL, EMAIL
Pseudocode Yes A Example Py Torch Implementation of Gist Masking
Open Source Code Yes Code, data, and model checkpoints are available at https://github.com/jayelm/gisting.
Open Datasets Yes To obtain the largest possible set of tasks for instruction finetuning, we create a dataset called Alpaca+, which combines the Self-Instruct [36] and Stanford Alpaca [31] instruction tuning datasets, each consisting of (t, x, y) tuples sampled from Open AI s text-davinci-001 and text-davinci-003 variants of GPT-3, respectively.
Dataset Splits Yes From Alpaca+ we hold out 3 validation splits: 1000 Seen prompts (with unseen, non-empty inputs); 1000 Unseen prompts (with non-empty inputs); and the 252 hand-written Human prompts and completions used in Wang et al. [36], of which 83% have non-empty inputs.
Hardware Specification Yes Experiments were run on a cluster machine with 4x A100-SXM4-80GB NVIDIA GPUs, 480GB RAM, and 16 CPUs, using Py Torch 2.0 [24], Hugging Face Transformers [41], and Deep Speed [29].
Software Dependencies Yes Experiments were run on a cluster machine with 4x A100-SXM4-80GB NVIDIA GPUs, 480GB RAM, and 16 CPUs, using Py Torch 2.0 [24], Hugging Face Transformers [41], and Deep Speed [29].
Experiment Setup Yes Full hyperparameters for training runs are located in Table A.1.