Representation Surgery: Theory and Practice of Affine Steering

Authors: Shashwat Singh, Shauli Ravfogel, Jonathan Herzig, Roee Aharoni, Ryan Cotterell, Ponnurangam Kumaraguru

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Second, we offer a series of experiments that demonstrate the empirical effectiveness of the methods in mitigating bias and reducing toxic generation.
Researcher Affiliation Collaboration 1IIIT Hyderabad 2Bar-Ilan University. Work done during an internship at Google Research. 3Google Research 4ETH Zurich.
Pseudocode No The paper does not contain any pseudocode or algorithm blocks.
Open Source Code Yes https://github.com/shauliravfogel/affine-steering
Open Datasets Yes we experiment on the Bios dataset (De-Arteaga et al., 2019)...we consider Blodgett et al. s (2016) dataset on various dialects of American English...Our two affine steering functions are fitted on balanced classification data that consists of full sentences with human toxicity labels, the Toxic Comments Classification Challenge data. https://www.kaggle.com/c/jigsaw-toxic-commentclassification-challenge...Wiki Text-2 (Merity et al., 2017).
Dataset Splits No The paper mentions training data and development sets (e.g., 'training section of the Bios dataset', 'development set accuracy') but does not specify explicit percentages or sample counts for training, validation, or test splits. It refers to 'a split of 10k samples from the non-toxic split of Real Toxicity Prompts' for evaluation, which serves as a test set, but lacks a full breakdown of all splits.
Hardware Specification No The paper does not specify the exact hardware (e.g., GPU/CPU models, memory) used for running its experiments.
Software Dependencies Yes We use the Python Optimal Transport (Flamary et al., 2021) implementation of the mean and covariance matching transformation...The MLP was trained in Scikit-learn (Pedregosa et al., 2011) version 1.3.2 with the default parameters.
Experiment Setup Yes To embed the biography using a single vector, we take the last-layer CLS representation for BERT and take the last-token, last-hidden-layer representations over the text for the other models. We lower the dimensionality of the Llama2 vectors to 768 using PCA. Then, we fit a logistic regression classifier...We use the same decoding sampling parameters as in Liu et al. (2021); Pozzobon et al. (2023); Gehman et al. (2020), they are listed in Table 5. Table 5: Number of Samples 25, Max length 20, temperature 1, top-p (sampling) 0.9, top-k (sampling) 0 (all).