Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Probing Hidden Knowledge Holes in Unlearned LLMs

Authors: Myeongseob Ko, Hoang Anh Just, Charles Fleming, Ming Jin, Ruoxi Jia

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our evaluation demonstrates significant hidden costs of unlearning: up to 98.7% of the test cases yield irrelevant or nonsensical responses from unlearned models, despite being answerable by the pretrained model. These findings necessitate rethinking the conventional approach to evaluating knowledge preservation in unlearning, moving beyond standard, static benchmarks. Utilizing this framework, we perform red-teaming studies on models subjected to various recent unlearning techniques. Our experiments reveal significant knowledge loss in both adjacent and latent knowledge areas, with numerous identified test cases where the original model provides high-quality responses, yet the unlearned model produces irrelevant, incomplete, or even unintelligible outputs (Figure 1).
Researcher Affiliation Collaboration Myeongseob Ko Virginia Tech EMAIL Hoang Anh Just Virginia Tech EMAIL Charles Fleming Cisco Research EMAIL Ming Jin Virginia Tech EMAIL Ruoxi Jia Virginia Tech EMAIL
Pseudocode No The paper describes methods and processes in narrative text and structured lists, but it does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks with a code-like format.
Open Source Code No The authors will also release the code and datasets publicly.
Open Datasets Yes The first forgetting set contains 50 harmful samples from PKU-Safe RLHF [Ji et al., 2024], covering topics such as drug abuse, weapons, and banned substances. The second forgetting set comprises 200 samples drawn from a bio-corpus of WMDP-Bio [Li et al., 2024], which we use to construct Dadj(Df). For LLMU, GAKL, and NPOGD, we use 817 samples from Truthful QA as Dr, following the convention of [Yao et al., 2023]. We also sample 800 Wikitext Merity et al. [2016] to construct Dadj(Dr) for RMU.
Dataset Splits Yes The first forgetting set contains 50 harmful samples from PKU-Safe RLHF [Ji et al., 2024]... The second forgetting set comprises 200 samples drawn from a bio-corpus of WMDP-Bio [Li et al., 2024]... For LLMU, GAKL, and NPOGD, we use 817 samples from Truthful QA as Dr... After post-hoc filtering (Section 3.3), DAP comprises 105 prompts for RMU and 161 for LLMU, GAKL, NPOGD. For DRL (Section 3.2), we first collect 1,627, 1,938, 1,334, and 1,837 raw prompts for RMU, GAKL, NPOGD, LLMU respectively. Post-hoc filtering narrows them to 359 (RMU), 1,678 (GAKL), 1,119 (NPOGD), and 1,640 (LLMU). ...producing final latent sets of 75, 350, 300, and 350 test cases for RMU, GAKL, NPOGD, and LLMU, respectively.
Hardware Specification Yes All experiments were conducted using an NVIDIA H100 GPU.
Software Dependencies No The paper mentions models like LLAMA-2-7B-BASE, Zephyr-7B, Mistral-7B-Instruct-v0.1, GPT-4o mini, and the sentence-transformer model 'all-Mini LM-L6-v2', along with the LoRA technique, but does not specify version numbers for general software libraries or frameworks like Python, PyTorch, or CUDA.
Experiment Setup Yes LLMU For the LLMU configuration, we set the learning rate to 5e-5 and used a batch size of 2. Both forget and retain weights are set to 1.0. We run 1000 iterations with a termination criterion of maximum loss threshold at 100. GAKL The GAKL implementation maintains the same hyperparameters as LLMU, but without random labeling. NPOGD For NPOGD, we utilize a learning rate to 1e-6 and conduct training for 10 epochs. Reinforcement Learning We use consistent hyperparameters across all RL experiments as detailed in Table 5. We use 50 epochs with the number of episodes of 128 for training the policy network.