PaCE: Parsimonious Concept Engineering for Large Language Models

Authors: Jinqi Luo, Tianjiao Ding, Kwan Ho Ryan Chan, Darshan Thaker, Aditya Chattopadhyay, Chris Callison-Burch, Rene Vidal

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct experiments on tasks such as response detoxification, faithfulness enhancement, and sentiment revising, and show that Pa CE achieves state-of-the-art alignment performance while maintaining linguistic capabilities.
Researcher Affiliation Academia University of Pennsylvania Johns Hopkins University {jinqiluo,tjding}@upenn.edu
Pseudocode Yes Algorithm 1: Overcomplete Oblique Projection (Obliq Proj)
Open Source Code No Our collected dataset for concept representations is available at https://github.com/peterljq/Parsimonious-Concept-Engineering. ... We opensource the Pa CE-1M dataset to facilitate future research and practical applications of LLM alignment, and will release the source code soon.
Open Datasets Yes Our collected dataset for concept representations is available at https://github.com/peterljq/Parsimonious-Concept-Engineering.
Dataset Splits No We conduct experiments on tasks such as response detoxification, faithfulness enhancement, and sentiment revising... We compare our method in defending maliciousness against activation manipulation methods ( 2.2) on the Safe Edit [74] dataset with its safety scorer... We use the Holistic Bias suite [66] and hate speech evaluator [64] to measure the sentiment of the response...
Hardware Specification Yes The experiments are conducted on a workstation of 8 NVIDIA A40 GPUs.
Software Dependencies No GPT-4-0125 is used for dictionary construction and concept partition. ... After retrieving the relevant knowledge (with the contriever [27]) from Wikipedia for concept synthesis, we take the top-5 ranked facts to append the instruction of LLM. The FAISS-indexed [31] Wikipedia is a snapshot of the 21 million disjoint text blocks from Wikipedia until December 2018.
Experiment Setup Yes Each response of the target LLM is set at a maximum of 512 tokens. Activation vectors are extracted from the last-29th to the last-11th layer (totaling 19 layers) of the target LLM s decoder layers. ... We set the scalar of the representation reading for concept vectors to 3.0. ... When solving the optimization problem for decomposition in 3.3, we set τ = 0.95 and α = 0.05 following the observations in [82].