You Only Cache Once: Decoder-Decoder Architectures for Language Models
Authors: Yutao Sun, Li Dong, Yi Zhu, Shaohan Huang, Wenhui Wang, Shuming Ma, Quanlu Zhang, Jianyong Wang, Furu Wei
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results demonstrate that YOCO achieves favorable performance compared to Transformer in various settings of scaling up model size and number of training tokens. |
| Researcher Affiliation | Collaboration | Tsinghua University Microsoft Research |
| Pseudocode | Yes | C Pseudo Code of Gated Retention |
| Open Source Code | No | Code will be released in camera-ready version. |
| Open Datasets | No | The curated training corpus is similar to [39]. |
| Dataset Splits | Yes | Results Figure 3 reports the validation loss with various parameter counts. |
| Hardware Specification | Yes | The experiments are conducted with H100-80GB GPU cards. |
| Software Dependencies | No | The paper states 'We implement a Triton [36] kernel for gated retention.', but does not provide version numbers for Triton or other software dependencies. |
| Experiment Setup | Yes | Detailed hyperparameters are described in Appendix D. |