reproducibilityindex.ai

ReFT: Representation Finetuning for Language Models

Authors: Zhengxuan Wu, Aryaman Arora, Zheng Wang, Atticus Geiger, Dan Jurafsky, Christopher D Manning, Christopher Potts

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We showcase Lo Re FT on eight commonsense reasoning tasks, four arithmetic reasoning tasks, instruction-tuning, and GLUE. In all these evaluations, our Re FTs deliver the best balance of efficiency and performance, and almost always outperform state-of-the-art PEFTs.
Researcher Affiliation	Academia	Stanford University Pr(Ai)2R Group {wuzhengx,aryamana,peterwz,atticusg}@stanford.edu {jurafsky,manning,cgpotts}@stanford.edu
Pseudocode	No	The paper includes mathematical equations, but no structured blocks explicitly labeled 'Pseudocode' or 'Algorithm'.
Open Source Code	Yes	We release a generic Re FT training library publicly at https://github.com/stanfordnlp/pyreft.
Open Datasets	Yes	Datasets. Our benchmark contains eight commonsense reasoning datasets, including Bool Q [Clark et al., 2019], PIQA [Bisk et al., 2020], SIQA [Sap et al., 2019], Hella Swag [Zellers et al., 2019], Wino Grande [Sakaguchi et al., 2021], ARC-e, ARC-c [Clark et al., 2018], and OBQA [Mihaylov et al., 2018].
Dataset Splits	Yes	Unlike previous work [Hu et al., 2022, 2023, Liu et al., 2024c] where hyperparameter tuning may involve optimising performance directly on test sets, we only tune our hyperparameters on development sets which do not contain any overlapping examples with the test sets of our tasks.
Hardware Specification	Yes	All of our experiments are run with a single GPU: NVIDIA A100 40G/80G or RTX 6000.
Software Dependencies	No	Our library is built on top of pyvene [Wu et al., 2024b], a library for performing and training activation interventions on arbitrary Py Torch models. We load our base LMs in torch.bfloat16 to save memory.
Experiment Setup	Yes	For our experiments, we must decide how many interventions to learn and which layers and input positions to apply each one on. We propose learning interventions on a fixed number of p prefix and s suffix positions in the prompt. Specifically, we tune four hyperparameters: 1. The number of prefix positions p to intervene on, i.e. positions {1,...,p}. 2. The number of suffix positions s to intervene on, i.e. positions {n s + 1,...,n}. 3. Which set of layers L to intervene on. 4. Whether or not to tie intervention parameters ϕ across different positions in the same layer.