Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Differentiable Sparsity via $D$-Gating: Simple and Versatile Structured Penalization
Authors: Chris Kolb, Laetitia Frost, Bernd Bischl, David Rügamer
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We validate our theory across vision, language, and tabular tasks, where D-Gating consistently delivers strong performance sparsity tradeoffs and outperforms both direct optimization of structured penalties and conventional pruning baselines. |
| Researcher Affiliation | Academia | Chris Kolb1,2 , Laetitia Frost1, Bernd Bischl1,2, and David Rügamer1,2 1Department of Statistics, LMU Munich 2Munich Center for Machine Learning (MCML) Author correspondence to EMAIL |
| Pseudocode | Yes | A Algorithm 16 In Algorithm 1, we provide the algorithm for sparse training using the proposed D-Gating method. |
| Open Source Code | No | All used datasets are publicly available and the surce code for the experiments and baselines will be submitted in the supplementary materials. We aim to collect the code in a publicly accessible Git repository as soon as possible. |
| Open Datasets | Yes | We apply D-Gating with D {2, 3, 4} to a Le Net-300-100 at the neuron level and train the model on the MNIST dataset using SGD. [...] We run experiments on CIFAR-10, CIFAR-100, and SVHN, using a VGG-16 [51] and Res Net-18 model [19]. [...] We test our approach on the Wine data set [7] using a range of λ values for a single tree-layer as suggested by [47]. [...] Table 2: Summary of datasets used in feature selection experiments. Dataset Training Samples Test Samples Classes Input Features ISOLET [9] 6,328 1,559 26 617 COIL20 [44] 1,152 288 20 400 ACTIVITY [2] 4,252 1,492 6 561 MNIST [34] 60,000 10,000 10 784 F-MNIST [64] 60,000 10,000 10 784 MADELON [17] 2,080 520 2 500 |
| Dataset Splits | Yes | For this, we simulate n = 200 training (and 2000 test) samples with p = 200 features grouped into 40 groups of 5 features each. [...] For datasets which do not explicitly provide a test set, we randomly assign 20% of the samples to the test set. [...] We conduct experiments on three standard image classification benchmark tasks described in Table 3. We use the train/test split provided by the datasets. [...] We set aside 10% of the training data for validation purposes. [...] we perform 5-fold cross-validation (CV) to obtain mean performance and sparsity metrics, together with standard errors. |
| Hardware Specification | Yes | All experiments were conducted either on a single NVIDIA RTX A6000 or A4000 GPU with 48GB and 16GB of memory, respectively. |
| Software Dependencies | No | The paper does not explicitly list specific software dependencies with version numbers (e.g., Python 3.8, PyTorch 1.9, CUDA 11.1) used for the experiments. While Adam [26] is mentioned, it refers to a method, not a software library with a version number. |
| Experiment Setup | Yes | We train the D-Gated models and direct L2,1 penalization using full-batch gradient descent for 1500 iterations using a cosine learning rate schedule with initial learning rate 5 10 2 and 0.9 momentum. [...] we use a fully-connected Le Net-300-100 architecture [35] with two hidden layers (300 and 100 neurons) and Re LU activation functions and Kaiming Normal initialization. [...] We train the network on all classification tasks and methods using SGD for 100 epochs with a batch size of 256 and cosine schedule with an initial learning rate of 0.1. [...] Table 4: Training hyperparameters for different architectures and image classification datasets. The learning rates correspond to D-Gating with D = {2, 3, 4} (set) and the comparison methods (second value). The comparison methods use standard Kaiming initialization. Architecture Dataset Epochs Batch size Optim. Mom. Init. LR Schedule VGG-16 CIFAR-10 200 128 SGD 0.9 Kaiming Normal {0.3,0.4,0.4}, 0.1 Cosine SVHN 200 128 SGD 0.9 Kaiming Normal {2e-3, 3e-3, 3e-3}, 2e-3 Cosine Res Net-18 CIFAR-100 200 128 SGD 0.9 Kaiming Normal {0.2,0.3,0.4}, 0.2 Cosine |