reproducibilityindex.ai

Time-Reversal Provides Unsupervised Feedback to LLMs

Authors: Yerram Varun, Rahul Madhavan, Sravanti Addepalli, Arun Suggala, Karthikeyan Shanmugam, Prateek Jain

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We show empirically (and theoretically in a stylized setting) that time-reversed models can indeed complement forward model predictions when used to score the query given response for re-ranking multiple forward generations. We obtain up to 5% improvement on the widely used Alpaca Eval Leaderboard over the competent baseline of best-of-N re-ranking using self log-perplexity scores. We further show that TRLM scoring outperforms conventional forward scoring of response given query, resulting in signiﬁcant gains in applications such as citation generation and passage retrieval.
Researcher Affiliation	Collaboration	Varun Yerram Google Deep Mind Rahul Madhavan Indian Institute of Science Sravanti Addepalli Google Deep Mind Arun Suggala Google Deep Mind Karthikeyan Shanmugam Google Deep Mind Prateek Jain Google Deep Mind
Pseudocode	Yes	Algorithm 1 TRLM-Ba.Pretrain, Algorithm 2 TRLM-Ba.Score, Algorithm 3 TRLM-Fo.Score, Algorithm 4 TRLM-Fo Ba.Pretrain, Algorithm 5 TRLM-Ba.Generate, Algorithm 6 TRLM-Fo.Generate
Open Source Code	No	We do not release any model or datasets.
Open Datasets	Yes	The pre-training setup for all TRLM models is identical to that of PALM2-Otter models described by Anil et al. [2023b], except for the token orders speciﬁed by our TRLM.pretrain methods for TRLM-Fo , TRLM-Ba and TRLM-Fo Ba respectively. We ﬁne-tune them on the FLa N dataset [Longpre et al., 2023] using the TRLM-xx.pretrain function. ... H Licenses and Copyrights Across Assets ... 7. CNN Daily Mail Citation: [Zhong et al., 2020] Asset Link: [link] License: Apache 2.0 license ... 8. MS-Marco Citation: [Bajaj et al., 2016] Asset Link: [link] License: Microsoft Terms and Conditions ... 9. NF-Corpus Citation: [Boteva et al., 2016b] Asset Link: [link] License: Terms of Use
Dataset Splits	No	The paper uses established benchmarks and datasets, but does not explicitly state the training/validation/test splits with specific percentages or counts for the experiments conducted in the paper. It refers to existing datasets and their use for evaluation (e.g., test split) but does not define its own splits for training/validation.
Hardware Specification	Yes	To pre-train TRLM models we use two TPUv5e pods[Cloud] for two weeks in the setup described by Anil et al. [2023b]. Further details on pre-training are provided in Appendix B. We run ﬁne-tuning on FLAN-dataset using a TPUv5e pod [Cloud] for 1 day.
Software Dependencies	No	The paper mentions using specific datasets (e.g., FLa N dataset) and models (e.g., PALM2-Otter, Gemini-Pro-1.0, Mixtral), but it does not specify software dependencies like programming language versions or library versions (e.g., PyTorch, TensorFlow versions).
Experiment Setup	Yes	We generate 16 responses using a temperature τ = 0.8 to ensure diversity of answers. We then rerank the responses using different variants of TRLM from the PALM2-Otter family of models (TRLM training details in the supplement). We further consider two baselines, Self scoring and Forward Baselines, as described in Table 1. Scoring prompts and Conditioning prompts used with various TRLM variants for this task are described in the Table 7 of Appendix C.1.