Towards Test-Time Refusals via Concept Negation

Authors: Peiran Dong, Song Guo, Junxiao Wang, Bingjie WANG, Jiewei Zhang, Ziming Liu

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our evaluation on multiple benchmarks shows that PROTORE outperforms state-of-the-art methods under various settings, in terms of the effectiveness of purification and the fidelity of generative images. ... Through comprehensive evaluations on multiple benchmarks, we demonstrate that PROTORE surpasses existing methods in terms of purification effectiveness and the fidelity of generated images across various settings. ... In this section, we empirically evaluate the effectiveness of our proposed PROTORE.
Researcher Affiliation Academia 1Hong Kong Polytechnic University 2Hong Kong University of Science and Technology 3King Abdullah University of Science and Technology & SDAIA-KAUST AI {peiran.dong,bingjie.wang,jiewei.zhang,ziming.liu}@connect.polyu.hk songguo@cse.ust.hk junxiao.wang@kaust.edu.sa
Pseudocode No Our proposed algorithm is formally presented in the Appendix, which consists primarily of two steps. While the paper mentions an algorithm in the appendix, the provided text does not contain the appendix to verify a pseudocode block.
Open Source Code No The paper does not contain an explicit statement about releasing code or a link to a code repository for the described methodology.
Open Datasets Yes Image Net subset. We first investigate the performance of single-concept refusal through numerical results. Specifically, we choose one class from Image Net as the negation target. ... Following the same setting in ESD [22], we select the Imagenette subset that consists of ten readily recognizable classes. ... Inappropriate Image Prompts (I2P) benchmark dataset [24] contains 4703 toxic prompts assigned to at least one of the following categories: hate, harassment, violence, self-harm, sexual, shocking, illegal activity. ... To this end, we follow prior work [24, 22] on generative text-to-image models and evaluate the COCO FID-30k scores of SD and the three additional methods, as presented in Table 3.
Dataset Splits No The paper mentions using training, validation, and test splits implicitly through terms like "Imagenet classifier" and "COCO 30k dataset" which usually have standard splits. However, it does not provide explicit details about percentages, counts, or specific splitting methodology for its experiments within the main text.
Hardware Specification No The paper does not provide any specific hardware details such as GPU models, CPU types, or memory used for running the experiments.
Software Dependencies No The paper refers to using pre-trained models like CLIP and Resnet-50 Imagenet classifier, but it does not specify any software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions, or specific library versions).
Experiment Setup Yes The refinement step size σ is set to 1.0 in our experiments unless specified otherwise. ... We employed inference guidance of 7.5 in our experiments.