Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

H-SPLID: HSIC-based Saliency Preserving Latent Information Decomposition

Authors: Lukas Miklautz, Chengzhi Shi, Andrii Shkabrii, Theodoros Thirimachos Davarakis, Prudence Lam, Claudia Plant, Jennifer Dy, Stratis Ioannidis

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirical evaluations on image classification tasks show that models trained with H-SPLID primarily rely on salient input components, as indicated by reduced sensitivity to perturbations affecting non-salient features, such as image backgrounds. 4 Experiments In this section, we describe our datasets, comparison methods, experiment setup, and performance metrics. Additional runtime experiments, implementation details, and hyperparameter tuning and configuration protocols provided in the Appendix F. Our code is publicly available at https:// github.com/neu-spiral/H-SPLID. Datasets and Encoder Models We evaluate H-SPLID on synthetic and natural image benchmarks, spanning five datasets and three architectures (Table 1). We create a synthetic Concatenated-MNIST (C-MNIST) dataset (see Fig. 1) by concatenating two MNIST digits with the left digit as the class label. We use Le Net-3 [27, 26] as an encoder.
Researcher Affiliation	Academia	1Department of Machine Learning and Systems Biology, Max Planck Institute of Biochemistry, Martinsried, Germany 2Northeastern University, Boston, MA, USA 3Faculty of Computer Science, University of Vienna, Vienna, Austria 4Doctoral School Computer Science, University of Vienna, Vienna, Austria 5Research Network Data Science, University of Vienna, Vienna, Austria
Pseudocode	Yes	B H-SPLID Pseudocode Algorithm 1 contains pseudo-code for H-SPLID, i.e., the alternating optimization algorithm presented in Section 3.4 to solve Problem (6).
Open Source Code	Yes	Our code is publicly available at https:// github.com/neu-spiral/H-SPLID.
Open Datasets	Yes	We evaluate H-SPLID on synthetic and natural image benchmarks, spanning five datasets and three architectures (Table 1). We create a synthetic Concatenated-MNIST (C-MNIST) dataset (see Fig. 1) by concatenating two MNIST digits with the left digit as the class label. We use Le Net-3 [27, 26] as an encoder. We construct a four-class subset of COCO [28] (bear, elephant, giraffe, zebra), coupled with a Res Net-18 [20] encoder. ISIC-2017 is a medical imaging dataset [9]. Image Net-9 (IN-9) [58], encompasses 368 classes from Image Net-1K instantiated in three variants: Original images, a Mixed Rand variant in which object foregrounds are put onto randomclass backgrounds, and an Only-FG variant with backgrounds entirely removed (See Figure 4). Counter Animal (CA) [55], splits i Naturalist wildlife photos into a Common set (exhibiting typical backgrounds) and a Counter set (featuring atypical yet plausible backgrounds) (see Fig. 3a). We use Res Net-50 as an encoder for ISIC-2017 and Image Net derived datasets. Table 7: License and source compliance for each dataset. Dataset URL License Image Net-1K [11] image-net.org Image Net Terms Image Net-9 [58] Git Hub Inherits Image Net Terms COCO [28] cocodataset.org CC BY 4.0 (annotations) / Flickr TOU (images) Counter Animal [55] counteranimal.github.io Inherits i Naturalist Terms ISIC-2017 [9] ISIC Challenge CC-0 C-MNIST (our codebase) Inherits MNIST Terms (CC BY-SA 3.0)
Dataset Splits	Yes	In all datasets, we employ a 80-20 validation split for tuning, and use held-out test sets for final evaluation. F.1 Datasets COCO is a segmentation dataset consisting of labeled images of various species of animals (See Figure 3b). For our experiments, we utilize a subset of the dataset composed of images drawn from one of four labels. The four species were carefully selected to ensure the largest possible dataset containing images without overlapping labels. Since we use the dataset for image classification, each sample should belong to one class and thus include animals from one and only one of the four selected classes. During pre-processing, the dataset is resized to 224x224 pixels. Finally, segmentation information is used to construct 224x224 masks, where the 0 entries denote the pixels occupied by the animal (salient object) in the original image. These masks specify the portion of the image shielded from adversarial perturbations. The splits are created from the public training data of COCO by splitting them into train (4509 samples), validation (1127 samples) and test (1411 samples). C-MNIST is a synthetically constructed variant of the original MNIST dataset [27]. To generate it, we first load the standard 28x28 single-channel digit images. Subsequently, each sample is randomly paired with another digit using a fixed seed for reproducibility. The two images are concatenated along the width to form a 56x28 composite, then symmetrically zero-padded to a uniform 64x64 resolution. During training and evaluation, we treat only the left-hand digit as the classification target, ensuring each composite image belongs to exactly one class. We use 80% of the original train split of MNIST as training data and 20% as validation data. For testing we use the test set of MNIST, where we also create image pairs as described above. F.4 Hyperparameters We divide our hyperparameters into two groups: those shared by all models, and those tuned or adapted per method and dataset. Shared parameters. All Image Net-1K Res Net-50 and COCO Res Net-18 experiments use the Adam optimizer [24] with β1 = 0.9, β2 = 0.999, ϵ = 10 8. We perform an initial grid search on a vanilla Res Net-18, sweeping the learning rate over {10 3, 10 4, 10 5, 10 6} in logarithmic steps, and select LR = 5 10 4 for all subsequent runs. The batch size is set to 256, and weight decay is 0 by default (except in weight-decay experiments). For the C-MNIST experiments we use Le Net-3 [26] with a 1024 embedding space, (as in Figure 1), we use the learning rate of LR = 1 10 5 and train for 50 epochs from random initialization.
Hardware Specification	Yes	F.3 Software and Hardware Setup We built our pipeline in Python, leveraging the Py Torch [39] library. To conduct our experiments, we use two identical internal servers running Ubuntu 22.04.3 LTS ( Jammy Jellyfish ) on a 5.15.084 x86_64 kernel. Each server is equipped with two Intel Xeon Gold 6326 processors (16 cores each, hyper-threaded for a total of 64 logical CPUs), 512 Gi B of RAM, and a single NVIDIA A100 80 GB GPU. For the ablation studies and experiments conducted on the ISIC-2017 dataset, we additionally made use of Euro HPC compute resources, including Mare Nostrum (BSC, Spain), Melu Xina (Lux Provide, Luxembourg), Deucalion (MACC, Portugal), and Discoverer (Sofia Tech, Bulgaria).
Software Dependencies	No	F.3 Software and Hardware Setup We built our pipeline in Python, leveraging the Py Torch [39] library. To conduct our experiments, we use two identical internal servers running Ubuntu 22.04.3 LTS ( Jammy Jellyfish ) on a 5.15.084 x86_64 kernel. Each server is equipped with two Intel Xeon Gold 6326 processors (16 cores each, hyper-threaded for a total of 64 logical CPUs), 512 Gi B of RAM, and a single NVIDIA A100 80 GB GPU. For the ablation studies and experiments conducted on the ISIC-2017 dataset, we additionally made use of Euro HPC compute resources, including Mare Nostrum (BSC, Spain), Melu Xina (Lux Provide, Luxembourg), Deucalion (MACC, Portugal), and Discoverer (Sofia Tech, Bulgaria). F.2 Implementation The HBa R code was adapted from the original codebase, which is publicly available at Git Hub (under MIT License). For weight decay, we reuse the Py Torch [39] implementation and pass it directly to the optimizer. We re-implemented the following regularization methods: Group Lasso Weights, Group Lasso Activations, L1 Sparse Activations, and L1 Sparse Weights. For both the Projected Gradient Descent (PGD) [34] and Auto Attack (AA) [10] adversaries, we utilize our version of the Torch Attacks [23] library, that is adapted for masked attacks.
Experiment Setup	Yes	F.4 Hyperparameters We divide our hyperparameters into two groups: those shared by all models, and those tuned or adapted per method and dataset. Shared parameters. All Image Net-1K Res Net-50 and COCO Res Net-18 experiments use the Adam optimizer [24] with β1 = 0.9, β2 = 0.999, ϵ = 10 8. We perform an initial grid search on a vanilla Res Net-18, sweeping the learning rate over {10 3, 10 4, 10 5, 10 6} in logarithmic steps, and select LR = 5 10 4 for all subsequent runs. The batch size is set to 256, and weight decay is 0 by default (except in weight-decay experiments). For the C-MNIST experiments we use Le Net-3 [26] with a 1024 embedding space, (as in Figure 1), we use the learning rate of LR = 1 10 5 and train for 50 epochs from random initialization. For both COCO and Image Net-1K we use Torch Vision [35] augmentations. Training augmentations include (1) Random Resized Crop to a 224 224 patch (scaling and cropping with a random area and aspect ratio), (2) Color Jitter applied with probability p = 0.8 (brightness 40%, contrast 40%, saturation 20%, hue 10%), (3) Random Grayscale with p = 0.2, (4) Random Horizontal Flip with p = 0.5, (5) Random Solarize with threshold 0.5 and p = 0.2, followed by (6) To Tensor and (7) Normalization using per-channel means and standard deviations (Image Net defaults [0.485, 0.456, 0.406], [0.229, 0.224, 0.225] or COCO-computed statistics). At test time, inputs are first resized so that the shorter side is 256 px, then center-cropped to 224 224, and finally passed through To Tensor and the same Normalization. Tuning strategy and per-method tuning ranges. We employ identical hyperparameter tuning strategies for H-SPLID and all comparison methods. 30% of the training corpus is randomly sampled for Image Net, while the complete training set is used for all other datasets. In either case, 20% of the samples are used to constitute the validation set. Hyperparameters are optimized via grid search by selecting the model configuration exhibiting the highest robust validation accuracy at the end of training, in which robustness is measured with respect to Projected Gradient Descent (PGD) [34] attacks applied to the entire image, so no knowledge of salient or non-salient regions is used. We use dataset-specific perturbation budgets of ϵ = 1 255 for Image Net and ϵ = 2 255 for COCO. These values were chosen to be strong enough to select for more robust models, while at the same time being not too strong to induce model collapse to random accuracy, so we can use the metric for model selection. Based on this selection criterion we trained each method on COCO three times and selected the run with highest robust accuracy for validation. For Image Net we only trained one run. Importantly, no information pertaining to the salient or non-salient regions is leveraged during the tuning. Table 8 summarizes the grid ranges we search for each method on Image Net-1K and COCO. H-SPLID selected settings. After tuning as above, the final hyperparameters chosen for H-SPLID on each dataset are listed in Table 9. In all H-SPLID runs we also set λce = 10 to balance the cross-entropy scale. Due to the scale of Image Net-1K, we introduce two scheduling parameters: (i) βinit_fraction, the fraction of training data used to compute the initial mask values (20% for Image Net1K, 100% for COCO), and (ii) βupdate_fraction, which determines the amount of training data that must be processed before updating the masks (5% of the dataset for Image Net-1K, corresponding to multiple updates per epoch; 100% for COCO, corresponding to one update per epoch).