You Only Cache Once: Decoder-Decoder Architectures for Language Models

Authors: Yutao Sun, Li Dong, Yi Zhu, Shaohan Huang, Wenhui Wang, Shuming Ma, Quanlu Zhang, Jianyong Wang, Furu Wei

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results demonstrate that YOCO achieves favorable performance compared to Transformer in various settings of scaling up model size and number of training tokens.
Researcher Affiliation Collaboration Tsinghua University Microsoft Research
Pseudocode Yes C Pseudo Code of Gated Retention
Open Source Code No Code will be released in camera-ready version.
Open Datasets No The curated training corpus is similar to [39].
Dataset Splits Yes Results Figure 3 reports the validation loss with various parameter counts.
Hardware Specification Yes The experiments are conducted with H100-80GB GPU cards.
Software Dependencies No The paper states 'We implement a Triton [36] kernel for gated retention.', but does not provide version numbers for Triton or other software dependencies.
Experiment Setup Yes Detailed hyperparameters are described in Appendix D.