Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Position: Fundamental Limitations of LLM Censorship Necessitate New Approaches
Authors: David Glukhov, Ilia Shumailov, Yarin Gal, Nicolas Papernot, Vardan Papyan
ICML 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Theoretical | We present fundamental limitations of verifying the semantic properties of LLM outputs and identifying compositional threats, illustrating inherent challenges of current approaches to censoring LLM outputs. Specifically, we demonstrate that semantic censorship can be perceived as an undecidable problem, and semantic properties of LLM outputs can become impossible to verify when the LLM is capable of providing "encrypted" outputs. We further show challenges of censorship can extend beyond just semantic censorship, as attackers can reconstruct impermissible outputs from a collection of permissible ones. Consequently, we call for a reevaluation of the problem of censorship and its goals, stressing the need for new definitions and approaches to censorship. In addition, we provide an initial attempt toward achieving this goal through syntactic censorship, drawing from a security perspective to design censorship methods that can provide guarantees. |
| Researcher Affiliation | Academia | 1University of Toronto & Vector Institute 2University of Oxford. Correspondence to: David Glukhov <EMAIL>. |
| Pseudocode | No | The paper describes algorithms in prose but does not include structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not mention providing open-source code for the methodology it describes. |
| Open Datasets | No | The paper is theoretical and does not conduct experiments involving training on datasets that it would need to make publicly available. |
| Dataset Splits | No | The paper is theoretical and does not involve dataset splits for training, validation, or testing. |
| Hardware Specification | No | The paper does not specify any hardware used for its theoretical analysis or demonstrations. |
| Software Dependencies | No | The paper discusses LLMs like GPT-4-turbo but does not list specific software dependencies with version numbers. |
| Experiment Setup | No | The paper is theoretical and does not describe an experimental setup with hyperparameters or training configurations. |