Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

You Only Cache Once: Decoder-Decoder Architectures for Language Models

Authors: Yutao Sun, Li Dong, Yi Zhu, Shaohan Huang, Wenhui Wang, Shuming Ma, Quanlu Zhang, Jianyong Wang, Furu Wei

NeurIPS 2024 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results demonstrate that YOCO achieves favorable performance compared to Transformer in various settings of scaling up model size and number of training tokens.
Researcher Affiliation Collaboration Tsinghua University Microsoft Research
Pseudocode Yes C Pseudo Code of Gated Retention
Open Source Code No Code will be released in camera-ready version.
Open Datasets No The curated training corpus is similar to [39].
Dataset Splits Yes Results Figure 3 reports the validation loss with various parameter counts.
Hardware Specification Yes The experiments are conducted with H100-80GB GPU cards.
Software Dependencies No The paper states 'We implement a Triton [36] kernel for gated retention.', but does not provide version numbers for Triton or other software dependencies.
Experiment Setup Yes Detailed hyperparameters are described in Appendix D.