Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Probing Hidden Knowledge Holes in Unlearned LLMs
Authors: Myeongseob Ko, Hoang Anh Just, Charles Fleming, Ming Jin, Ruoxi Jia
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our evaluation demonstrates significant hidden costs of unlearning: up to 98.7% of the test cases yield irrelevant or nonsensical responses from unlearned models, despite being answerable by the pretrained model. These findings necessitate rethinking the conventional approach to evaluating knowledge preservation in unlearning, moving beyond standard, static benchmarks. Utilizing this framework, we perform red-teaming studies on models subjected to various recent unlearning techniques. Our experiments reveal significant knowledge loss in both adjacent and latent knowledge areas, with numerous identified test cases where the original model provides high-quality responses, yet the unlearned model produces irrelevant, incomplete, or even unintelligible outputs (Figure 1). |
| Researcher Affiliation | Collaboration | Myeongseob Ko Virginia Tech EMAIL Hoang Anh Just Virginia Tech EMAIL Charles Fleming Cisco Research EMAIL Ming Jin Virginia Tech EMAIL Ruoxi Jia Virginia Tech EMAIL |
| Pseudocode | No | The paper describes methods and processes in narrative text and structured lists, but it does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks with a code-like format. |
| Open Source Code | No | The authors will also release the code and datasets publicly. |
| Open Datasets | Yes | The first forgetting set contains 50 harmful samples from PKU-Safe RLHF [Ji et al., 2024], covering topics such as drug abuse, weapons, and banned substances. The second forgetting set comprises 200 samples drawn from a bio-corpus of WMDP-Bio [Li et al., 2024], which we use to construct Dadj(Df). For LLMU, GAKL, and NPOGD, we use 817 samples from Truthful QA as Dr, following the convention of [Yao et al., 2023]. We also sample 800 Wikitext Merity et al. [2016] to construct Dadj(Dr) for RMU. |
| Dataset Splits | Yes | The first forgetting set contains 50 harmful samples from PKU-Safe RLHF [Ji et al., 2024]... The second forgetting set comprises 200 samples drawn from a bio-corpus of WMDP-Bio [Li et al., 2024]... For LLMU, GAKL, and NPOGD, we use 817 samples from Truthful QA as Dr... After post-hoc filtering (Section 3.3), DAP comprises 105 prompts for RMU and 161 for LLMU, GAKL, NPOGD. For DRL (Section 3.2), we first collect 1,627, 1,938, 1,334, and 1,837 raw prompts for RMU, GAKL, NPOGD, LLMU respectively. Post-hoc filtering narrows them to 359 (RMU), 1,678 (GAKL), 1,119 (NPOGD), and 1,640 (LLMU). ...producing final latent sets of 75, 350, 300, and 350 test cases for RMU, GAKL, NPOGD, and LLMU, respectively. |
| Hardware Specification | Yes | All experiments were conducted using an NVIDIA H100 GPU. |
| Software Dependencies | No | The paper mentions models like LLAMA-2-7B-BASE, Zephyr-7B, Mistral-7B-Instruct-v0.1, GPT-4o mini, and the sentence-transformer model 'all-Mini LM-L6-v2', along with the LoRA technique, but does not specify version numbers for general software libraries or frameworks like Python, PyTorch, or CUDA. |
| Experiment Setup | Yes | LLMU For the LLMU configuration, we set the learning rate to 5e-5 and used a batch size of 2. Both forget and retain weights are set to 1.0. We run 1000 iterations with a termination criterion of maximum loss threshold at 100. GAKL The GAKL implementation maintains the same hyperparameters as LLMU, but without random labeling. NPOGD For NPOGD, we utilize a learning rate to 1e-6 and conduct training for 10 epochs. Reinforcement Learning We use consistent hyperparameters across all RL experiments as detailed in Table 5. We use 50 epochs with the number of episodes of 128 for training the policy network. |