Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
LinEAS: End-to-end Learning of Activation Steering with a Distributional Loss
Authors: Pau Rodriguez, Michal Klein, Eleonora Gualdoni, Valentino Maiorca, Arno Blaas, Luca Zappella, Marco Cuturi, Xavier Suau
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 4 Experimental Results We analyze the effectiveness of Lin EAS at the important task of toxicity mitigation. To that end, we compare with prompting, CAA [7], Re FT [12], ITI-C [9] and Lin-ACT [10] on three LLMs ranging from 1.5B to 7B parameters, by aligning the activations of N = 32 toxic to 32 non-toxic sentences sampled from the Jigsaw dataset [23]. |
| Researcher Affiliation | Collaboration | Pau Rodríguez Apple Michal Klein Apple Eleonora Gualdoni Apple Valentino Maiorca Sapienza, Apple Arno Blaas Apple Luca Zappella Apple Marco Cuturi Apple Xavier Suau Apple |
| Pseudocode | Yes | Algorithm 1 Proximal E2E Training Step. 1: Require: prompts (xi)i p, (yj)j q, LR ρ. 2: (pre-) compute activations ηi ℓ, i n, ℓ L Eq.(2) 3: compute activations lists ξi ℓ, i n, ℓ L. Eq.(1) 4: set loss to C = 0 5: for ℓ L do Forward 6: Z := [ξ1 ℓ, . . . , ξn ℓ] Rn dℓ Eq. (1) 7: V := [η1 ℓ, . . . , ηn ℓ] Rn dℓ Eq. (2) 8: C C + (Z, V ) ℓ-layer loss, Eq. (4) 9: end for 10: for ℓ L do 11: gω, gb ωℓ,bℓC Backpropagation 12: ωℓ, bℓ ωℓ ρ gω, bℓ ρ gb Updates 13: ωℓ ProxγλG 2 STγλ1(ωℓ 1) +1 Eq. (7) 14: bℓ ProxγλG 2 STγλ1(bℓ) Eq. (7) 15: end for |
| Open Source Code | No | 1https://github.com/apple/ml-lineas. Justification: The data we use is all publicly available and we intend to release the code as well. Unfortunately the code release will happen sometimes after the submission. |
| Open Datasets | Yes | We analyze the effectiveness of Lin EAS at the important task of toxicity mitigation. To that end, we compare with prompting, CAA [7], Re FT [12], ITI-C [9] and Lin-ACT [10] on three LLMs ranging from 1.5B to 7B parameters, by aligning the activations of N = 32 toxic to 32 non-toxic sentences sampled from the Jigsaw dataset [23]. We evaluate toxicity mitigation on the Real Toxicity Prompts (RTP) dataset [24] and the Thoroughly Engineered Toxicity (TET) dataset [25]. Utility Metrics. To measure whether the utility of the model is affected by these interventions, we report PPLWIK, the perplexity obtained on a fixed set of 20k Wikipedia sentences [27], as well as the overall 5-shot accuracy on the MMLU compendium [28]. We draw a set of 50 concepts, from the MEN dataset [29], a resource of 3,000 word pairs annotated with human similarity judgments |
| Dataset Splits | Yes | We analyze the effectiveness of Lin EAS at the important task of toxicity mitigation. To that end, we compare with prompting, CAA [7], Re FT [12], ITI-C [9] and Lin-ACT [10] on three LLMs ranging from 1.5B to 7B parameters, by aligning the activations of N = 32 toxic to 32 non-toxic sentences sampled from the Jigsaw dataset [23]. For RTP, we follow Rodriguez et al. [10] by sampling 1000 prompts from the dataset and let the model (intervened or not) complete them. For TET, we use the 2546 prompts provided. In this section, we complement and corroborate our insights from the experiments on toxicity mitigation in Section 4.1 with additional experiments on inducing truthfulness in LLMs, using the Truthful QA benchmark [44]. In particular, we investigate how well Lin EAS achieves to induce truthfulness on this benchmark in comparison to Lin-ACT, its strongest activation steering competitor from Section 4.1. For Lin EAS, we apply the intervention again to the post layernorm layers, while for Lin-ACT, we apply them to all layernorm layers as this was reported as optimal for Lin-ACT for Truthful QA experiments in Rodriguez et al. [10]. We use 2-fold cross-validation on the 817 questions of the multiple choice part of the benchmark and learn the intervention on the concatenation of training fold questions concatenated with either incorrect (source) or correct (target) multiple-choice answer options. |
| Hardware Specification | Yes | The experiments in this work were computed on a single NVIDIA A100 GPU with 80GB RAM and they could also fit in an NVIDIA A100 with 40GB RAM. |
| Software Dependencies | No | We perform parameter-efficient adaptation of our baseline models with Huggingface s implementation of Lo RA in their PEFT library and Huggingface s implementation of the PPO reinforcement learning algorithm in their TRL library. This is because Lin EAS leverages Py Torch s backpropagation, which is optimized compared to ITI and Lin-Act s layer-wise estimation methods. |
| Experiment Setup | Yes | Setup. All methods have access to only 32 toxic and 32 non-toxic sentences (unpaired). We optimize Lin EAS for 1K steps using SGD and a learning rate of 0.1. As we focus on the benefit of using an end-to-end loss, we use γ = 0 for Lin EAS (see Section 4.3 for γ s impact). For Lin-ACT and CAA, we use their default settings, and set intervention strength to λ = 1. For Re FT, we train for 10 epochs (selected with an epoch sweep). For ITI-C we use λ = 0.5, obtained through grid search. Setup. As in [10], we modulate the strength of Lin EAS applied to all layernorm layers by introducing a scale 0 λ 1 when applying interventions. Intuitively λ = 0 results in no intervention, while λ = 1 carries out a full Lin EAS transport, any value in between reflecting a gradual change. Following Section 4.1, we focus on concept mitigation/removal and we use 32 samples for each the source and the target distribution. We train Lin EAS for 1000 steps with batch size 4, Adam W, learning rate of 1e 4 and γ = 0. Find additional implementation details in Appendix N. Table 11: List of hyper-parameters used to train PPO for 32 and 1024 samples. |