PaCE: Parsimonious Concept Engineering for Large Language Models
Authors: Jinqi Luo, Tianjiao Ding, Kwan Ho Ryan Chan, Darshan Thaker, Aditya Chattopadhyay, Chris Callison-Burch, Rene Vidal
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct experiments on tasks such as response detoxification, faithfulness enhancement, and sentiment revising, and show that Pa CE achieves state-of-the-art alignment performance while maintaining linguistic capabilities. |
| Researcher Affiliation | Academia | University of Pennsylvania Johns Hopkins University {jinqiluo,tjding}@upenn.edu |
| Pseudocode | Yes | Algorithm 1: Overcomplete Oblique Projection (Obliq Proj) |
| Open Source Code | No | Our collected dataset for concept representations is available at https://github.com/peterljq/Parsimonious-Concept-Engineering. ... We opensource the Pa CE-1M dataset to facilitate future research and practical applications of LLM alignment, and will release the source code soon. |
| Open Datasets | Yes | Our collected dataset for concept representations is available at https://github.com/peterljq/Parsimonious-Concept-Engineering. |
| Dataset Splits | No | We conduct experiments on tasks such as response detoxification, faithfulness enhancement, and sentiment revising... We compare our method in defending maliciousness against activation manipulation methods ( 2.2) on the Safe Edit [74] dataset with its safety scorer... We use the Holistic Bias suite [66] and hate speech evaluator [64] to measure the sentiment of the response... |
| Hardware Specification | Yes | The experiments are conducted on a workstation of 8 NVIDIA A40 GPUs. |
| Software Dependencies | No | GPT-4-0125 is used for dictionary construction and concept partition. ... After retrieving the relevant knowledge (with the contriever [27]) from Wikipedia for concept synthesis, we take the top-5 ranked facts to append the instruction of LLM. The FAISS-indexed [31] Wikipedia is a snapshot of the 21 million disjoint text blocks from Wikipedia until December 2018. |
| Experiment Setup | Yes | Each response of the target LLM is set at a maximum of 512 tokens. Activation vectors are extracted from the last-29th to the last-11th layer (totaling 19 layers) of the target LLM s decoder layers. ... We set the scalar of the representation reading for concept vectors to 3.0. ... When solving the optimization problem for decomposition in 3.3, we set τ = 0.95 and α = 0.05 following the observations in [82]. |