Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
PaCE: Parsimonious Concept Engineering for Large Language Models
Authors: Jinqi Luo, Tianjiao Ding, Kwan Ho Ryan Chan, Darshan Thaker, Aditya Chattopadhyay, Chris Callison-Burch, Rene Vidal
NeurIPS 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct experiments on tasks such as response detoxification, faithfulness enhancement, and sentiment revising, and show that Pa CE achieves state-of-the-art alignment performance while maintaining linguistic capabilities. |
| Researcher Affiliation | Academia | University of Pennsylvania Johns Hopkins University EMAIL |
| Pseudocode | Yes | Algorithm 1: Overcomplete Oblique Projection (Obliq Proj) |
| Open Source Code | No | Our collected dataset for concept representations is available at https://github.com/peterljq/Parsimonious-Concept-Engineering. ... We opensource the Pa CE-1M dataset to facilitate future research and practical applications of LLM alignment, and will release the source code soon. |
| Open Datasets | Yes | Our collected dataset for concept representations is available at https://github.com/peterljq/Parsimonious-Concept-Engineering. |
| Dataset Splits | No | We conduct experiments on tasks such as response detoxification, faithfulness enhancement, and sentiment revising... We compare our method in defending maliciousness against activation manipulation methods ( 2.2) on the Safe Edit [74] dataset with its safety scorer... We use the Holistic Bias suite [66] and hate speech evaluator [64] to measure the sentiment of the response... |
| Hardware Specification | Yes | The experiments are conducted on a workstation of 8 NVIDIA A40 GPUs. |
| Software Dependencies | No | GPT-4-0125 is used for dictionary construction and concept partition. ... After retrieving the relevant knowledge (with the contriever [27]) from Wikipedia for concept synthesis, we take the top-5 ranked facts to append the instruction of LLM. The FAISS-indexed [31] Wikipedia is a snapshot of the 21 million disjoint text blocks from Wikipedia until December 2018. |
| Experiment Setup | Yes | Each response of the target LLM is set at a maximum of 512 tokens. Activation vectors are extracted from the last-29th to the last-11th layer (totaling 19 layers) of the target LLM s decoder layers. ... We set the scalar of the representation reading for concept vectors to 3.0. ... When solving the optimization problem for decomposition in 3.3, we set τ = 0.95 and α = 0.05 following the observations in [82]. |