Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Auto-Compressing Networks
Authors: Evangelos Dorovatas, Georgios Paraskevopoulos, Alexandros Potamianos
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We implement ACNs in fully connected and transformer-based architectures, finding that they match or outperform residual baselines, while 30 80% of top layers effectively become redundant as information concentrates in the lower layers. Notably, ACNs are hardwarefriendly and require no specialized software. [...] We implement auto-compressing networks on top of state-of-the-art neural architectures across diverse tasks and datasets. We implement ACNs using variants of the Transformer [52] for language and vision tasks and MLP-Mixer [50] for vision tasks. This allows us to evaluate our approach on diverse benchmarks including image classification (CIFAR-10 [27], Image Net-1K [41]), sentiment analysis, and language understanding (BERT [9] on GLUE [54]). |
| Researcher Affiliation | Academia | Vaggelis Dorovatas1,2 EMAIL Georgios Paraskevopoulos3 EMAIL Alexandros Potamianos1,2 EMAIL 1National Technical University of Athens 2Archimedes RU, Athena RC 3Institute of Language and Speech Processing, Athena RC |
| Pseudocode | No | The paper describes the model architecture and gradient dynamics using mathematical equations and textual explanations, for example, in Section 2.1 "Gradient Propagation Across Network Architectures" and Appendix C "Gradient Propagation equations derivation". However, it does not include any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | 1Code for the paper is available here. [...] We provide an anonymous link where code for the paper will be released. |
| Open Datasets | Yes | We implement ACNs using variants of the Transformer [52] for language and vision tasks and MLP-Mixer [50] for vision tasks. This allows us to evaluate our approach on diverse benchmarks including image classification (CIFAR-10 [27], Image Net-1K [41]), sentiment analysis, and language understanding (BERT [9] on GLUE [54]). [...] Using the original BERT pretraining corpus (Books Corpus [64] and English Wikipedia). |
| Dataset Splits | Yes | create a random subset of CIFAR-10 [27] by retaining only 100 samples per class, resulting in a total of 1000 examples. [...] The experiments are performed with the AC-Vi T and residual Vi T architectures trained on Image Net-1K. [...] We compare the ACN and residual architectures in the standard BERT pre-training and fine-tuning paradigm. [...] we evaluate both architectures on the split CIFAR-100 continual learning benchmark, comprising 20 sequential disjoint 5-class classification tasks. |
| Hardware Specification | Yes | We acknowledge Euro HPC JU project ID EHPC-AI-2024-A04-051 for use of the supercomputer LEONARDO@ CINECA, Italy. |
| Software Dependencies | No | Appendix B mentions "Adam W optimizer [34]" but does not specify software dependencies like programming language versions, library versions (e.g., PyTorch, TensorFlow), or other specific software packages with their versions required for replication. |
| Experiment Setup | Yes | CIFAR-10 MLP Mixer: The MLP Mixers have 16 layers with a hidden size of 128. The patch size is 4 (the input is 32x32, 3 channels). The MLP dimension DC is 512, while DS is 64. We are using the Adam W optimizer [34] with a maximum learning rate of 0.001 and a Cosine Scheduler with Warmup. The batch size is 64. [...] For both models we use 256 batch size due to memory constraints. AC-Vi T converges at 700 epochs, while the Residual Vi T converges at 300 epochs. [...] For this purpose, we create a random subset of CIFAR-10 [27] by retaining only 100 samples per class, resulting in a total of 1000 examples. Using the same training settings and models as described in Section 4.1 (MLP-Mixer on CIFAR10), we train both architectures for 150 epochs [...] Models are trained for 10 epochs per task [...] For Synaptic Intelligence we use a coefficient λ = 1. |