Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

On Optimal Steering to Achieve Exact Fairness

Authors: mohit sharma, Amit Deshpande, Chiranjib Bhattacharyya, Rajiv Ratn Shah

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, our optimal steering techniques on both synthetic and real-world datasets improve fairness without diminishing utility (and sometimes even improve utility). We demonstrate affine steering of LLM representations to reduce bias in multi-class classification, e.g., occupation prediction from a short biography in Bios dataset (De-Arteaga et al.). Furthermore, we steer internal representations of LLMs towards desired outputs so that it works equally well across different groups. Sections 5 and 6 describe 'Case Study on Gaussian Distributions' and 'Applications: Steering the Representations of an LLM' respectively, detailing experimental results and metrics like TPR-gap, KL divergence, JS divergence, and Bayes Error.
Researcher Affiliation Collaboration Mohit Sharma Department of Computer Science IIIT Delhi, India EMAIL Amit Jayant Deshpande Microsoft Research India EMAIL Chiranjib Bhattacharyya Department of Computer Science and Automation Indian Institute of Science, Bengaluru, India EMAIL Rajiv Ratn Shah Department of Computer Science IIIT Delhi, India EMAIL
Pseudocode No The paper describes methods using mathematical derivations and prose, but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes The code to reproduce our results is provided here. For the multi-class experiments, we use a lot of helper functions from the code of Singh et al. [67] 2. For the emotion steering experiments, we reproduce the methodology from Zhao et al. [79] and provide the Jupyter notebook in our code.
Open Datasets Yes In this experiment, we aim to steer data representations to reduce disparities between groups in a multi-class classification setting. We use the Bias in Bios dataset [27], which comprises web-sourced biographies labelled by profession and annotated for gender.
Dataset Splits Yes Our experimental setup closely follows that of Singh et al. [67], who generate representations using the Llama-2 7b model [72] and propose a method called Mi Mi C (Mean+Covariance Matching), which steers representations via least squares alignment of the first two moments.
Hardware Specification No The paper mentions LLM models like Llama-2 7b, Llama-3 8B, and GPT-4o models, but does not provide specific hardware details (GPU/CPU models, memory, etc.) used for running the experiments in the main text or supplementary material.
Software Dependencies No The paper mentions using LLM models and Jupyter notebooks for experiments, and that code for reproduction is provided. However, it does not provide specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions) within the paper text or supplementary material.
Experiment Setup Yes At each layer l, the final token representation hl is updated as hl = (1 a) hl +a vl c, where a is the strength of the steering. Zhao et al. [79] experiment with values of a (0.03, 0.08). We fix a = 0.03 for our experiments. We assume an affine relationship between the original and transformed samples per subgroup: Y = aya X + bya, where Y N( µya, σya) and X N(µya, σya). Taking expectation on both sides gives us: µya = ayaµya + bya, σ2 ya = a2 yaσ2 ya, and we get the following coefficients: aya = σya / σya and bya = µya - ayaµya.