Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Machine Unlearning via Task Simplex Arithmetic

Authors: Junhao Dong, Hao Zhu, Yifei Zhang, Xinghua Qu, Yew Soon Ong, Piotr Koniusz

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments and analyses across diverse datasets and scenarios demonstrate the efficacy of our method.
Researcher Affiliation	Collaboration	Junhao Dong1,2, Hao Zhu3, Yifei Zhang1, Xinghua Qu3, Yew-Soon Ong1,2 , and Piotr Koniusz4,5,6 1Nanyang Technological University, 2CFAR, IHPC, A*STAR, 3Bytedance, 4Data61 CSIRO, 5University of New South Wales, 6Australian National University
Pseudocode	No	The paper describes methods and derivations in mathematical formulations and text, but does not present any structured pseudocode or algorithm blocks.
Open Source Code	No	Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [No]
Open Datasets	Yes	Datasets. Following prior works [29, 50], we adopt the identical experimental setup. The unlearning evaluation of CLIP is performed on eight datasets designated as the forget set. To assess performance of retained knowledge, we use Image Net [7] as the retain set. Further details are in Appendix A.1. Image Net [7] as the retain set. The eight datasets used as forget sets are categorized by their recognition scenario as follows: 1. Fine-grained classification: Stanford Cars [35]. 2. Texture recognition: Describable Textures Dataset (DTD) [5]. 3. Remote sensing and aerial imagery: Euro SAT [26], and Remote Sensing Image Scene Classification (RESISC) [3]. 4. Traffic and digit recognition: German Traffic Sign Recognition Benchmark (GTSRB) [55], MNIST [37], and Street View House Numbers (SVHN) [47]. 5. Scene recognition: SUN397 [58]. All the datasets used in the paper are publicly available.
Dataset Splits	No	For each dataset (eight datasets in total) in the forget set, we obtain the task vector by fine-tuning the CLIP model on its training split. Unlearning evaluation is then performed using the Image Net validation set, as its test labels are not publicly available.
Hardware Specification	Yes	All the experiments in this paper are conducted based on eight NVIDIA H100 GPUs with 80GB of memory.
Software Dependencies	No	Specifically, we adopt the Adam W optimizer [44] with a peak learning rate of 1 10 5, momentum (0.9, 0.999) along with a cosine annealing scheduler and a weight decay of 0.1.
Experiment Setup	Yes	Implementation Details. Unless otherwise stated, we adopt the standard CLIP fine-tuning protocol for generating the task vectors on each forget dataset, in line with prior task arithmetic studies [29, 50]. Specifically, we adopt the Adam W optimizer [44] with a peak learning rate of 1 10 5, momentum (0.9, 0.999) along with a cosine annealing scheduler and a weight decay of 0.1. During fine-tuning, the CLIP text encoder is frozen to retain the integrity of the pre-trained textual representations for the classification head via zero-shot prompt embeddings. To ensure diversity of task vectors, we generate a pool of 30 fine-tuned CLIP models by varying the data augmentation configurations, e.g., using different hyper-parameters of Rand Augment [6]. All VLM unlearning experiments are conducted across diverse CLIP backbones [52], including Vi T-Base/32, Vi T-Base/16, and Vi T-Large/14. For each unlearning scenario, the task vector negation coefficient λ is selected on a small held-out subset of the training data. Unless otherwise specified, the task simplex is constructed using 30 vertices, and the distillation variance penalty is set to β = 2.0.