Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Better autoregressive regression with LLMs via regression-aware fine-tuning

Authors: Michal Lukasik, Zhao Meng, Harikrishna Narasimhan, Yin-Wen Chang, Aditya Krishna Menon, Felix Yu, Sanjiv Kumar

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate how RAFT improves over established baselines on several benchmarks and model families. ... We systematically compare RAFT against autoregressive and predictive head baselines, and consider several ablations for understanding the crucial design decisions for making a decoder-based LLM work under different settings. See Table 1 for an overview of both the previous works and the approach introduced in this work. Overall, our contributions are as follows: ... (iii) We systematically compare autoregressive regression, predictive head and RAFT approaches across multiple datasets and LLMs, and consistently find RAFT to be the most performant. ... 5 EXPERIMENTAL RESULTS
Researcher Affiliation	Industry	Michal Lukasik, Zhao Meng, Harikrishna Narasimhan, Yin-Wen Chang, Aditya Krishna Menon, Felix X. Yu, Sanjiv Kumar EMAIL Google Research
Pseudocode	No	The paper does not contain any explicitly labeled pseudocode or algorithm blocks. Theoretical concepts are presented using mathematical notation and text descriptions.
Open Source Code	No	The paper does not contain any explicit statement about releasing source code for the methodology, nor does it provide a link to a code repository.
Open Datasets	Yes	(i) US Amazon reviews, where we aim to predict the 5-star rating for a product review (Ni et al., 2019). ... (ii) Semantic Textual Similarity Benchmark (STSB) (Cer et al., 2017), comprising of sentence pairs human-annotated with a similarity score from 0 to 5. ... (iii) Movie Lens-1M, where we construct a movie rating prediction task following Luo et al. (2024). ... (Vacareanu et al., 2024a) (the Original #1 dataset).
Dataset Splits	Yes	We use 1,500 examples for the test set (after Lukasik et al. (2024)), 1,500 for validation and 10,000 examples for training. ... (STSB 1K; see Table 8 in Appendix). ... We summarize the dataset statistics and the prompts in Table 6 and Table 7 (Appendix). Table 7: Summary of dataset statistics. Wireless: Train size 10,000, Validation size 1,500, test size 1,500. Personal care: Train size 10,000, Validation size 1,500, test size 1,500. Music: Train size 10,000, Validation size 1,500, test size 1,500. STSB: Train size 4,887, Validation size 863, test size 1,500. STSB 1k: Train size 1,000, Validation size 863, test size 1,500. Movie Lens-1M: Train size 797,758, Validation size 10,145, test size 10,145. Synthetic (Original #1 from (Vacareanu et al., 2024a)): Train size 10,000, Validation size 1,000, test size 1,000.
Hardware Specification	No	The paper mentions experimenting with Gemma-2 and Pa LM-2 models, but it does not specify the particular GPU or CPU models, or any other hardware specifications used for running the experiments.
Software Dependencies	No	We use the Adafactor optimizer to save memory during the ﬁne-tuning (we ﬁnd Adam to not perform better). The parameters for Adafactor are: ϵ1 = 10-30, ϵ2 = 10-3, decay rate = 0.8. ... For Movie Lens, we use Adam W optimizer and sweep learning rates from the range {10-4, 10-5, 10-6}.
Experiment Setup	Yes	We use dropout rate 0.1 and batch size 16. We train for 200K steps and select the best step using the held out validation set... We use a constant learning rate schedule. We select the learning rate value over the validation set from the values: 10-4, 10-5, 10-6. ... For Pa LM-2, we use the above settings, except we use batch size 64 and no dropout, and train for 5K steps and report the results from the last checkpoint. For Movie Lens, we use Adam W optimizer and sweep learning rates from the range {10-4, 10-5, 10-6}. We use a cosine decay schedule for the learning rate, with 10K steps of warmup from learning rate 10-8.