Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Exploiting Dynamic Sparsity in Einsum

Authors: Christoph Staudt, Mark Blacher, Tim Hoffmann, Lea Kasche, Olaf Beyersdorff, Joachim Giesen

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We first verify Proposition 1 empirically by simulating the model counting instance GRIDn. Since this is a synthetic result, we also analyze dynamic sparsity on a recent benchmark of 168 einsum expressions [9] from areas such as probabilistic models, weighted model counting, tensorized language models, and quantum computing. Our findings show that hyper-sparse instances occur in practice but often become evident only during execution. Sparse formats are efficient in such cases but introduce overhead on dense instances. To address this, we propose a hybrid approach that performs well across the benchmark. All experiments were run five times and we report median runtimes. Experiments were conducted on a 64-core system (4 Intel Xeon Gold 6130, 2.1 GHz) with 1.5 TB RAM. Code is available at https://github.com/ti2-group/dynamic-sparsity-einsum. We use Py Torch [25] for the dense baseline, as it has been shown to perform best on the einsum benchmark. Existing sparse einsum libraries cannot handle all benchmark cases due to index limitations, especially in expressions with many input tensors and large intermediate tensors, so we use our own prototype implementation. We exclude ten dense instances from the einsum benchmark due to resource constraints, leaving 158 instances. Optimized contraction orders for all instances are provided by the benchmark.
Researcher Affiliation	Academia	Christoph Staudt , Mark Blacher, Tim Hoffmann, Kaspar Kasche, Olaf Beyersdorff, Joachim Giesen Friedrich Schiller University Jena
Pseudocode	No	The paper describes the hybrid algorithm verbally and with figures (Figure 3, 4, 5, 6) showing its performance characteristics, but does not present it in a structured pseudocode or algorithm block.
Open Source Code	Yes	Code is available at https://github.com/ti2-group/dynamic-sparsity-einsum.
Open Datasets	Yes	on a recent benchmark of 168 einsum expressions [9] from areas such as probabilistic models, weighted model counting, tensorized language models, and quantum computing.
Dataset Splits	No	The paper evaluates performance on a benchmark of einsum expressions. It mentions excluding ten dense instances due to resource constraints, leaving 158 instances, but it does not specify explicit training/test/validation splits in the traditional machine learning sense for model training.
Hardware Specification	Yes	Experiments were conducted on a 64-core system (4 Intel Xeon Gold 6130, 2.1 GHz) with 1.5 TB RAM.
Software Dependencies	No	We use Py Torch [25] for the dense baseline, as it has been shown to perform best on the einsum benchmark. Existing sparse einsum libraries cannot handle all benchmark cases due to index limitations, especially in expressions with many input tensors and large intermediate tensors, so we use our own prototype implementation. The paper mentions software like PyTorch and sparse but does not provide specific version numbers for these components.
Experiment Setup	Yes	Hybrid algorithm. The hybrid algorithm evaluates contractions in the dense format while the average density of the remaining (not-yet-contracted) tensors stays above a chosen threshold. Once the average density falls below that threshold, the algorithm switches to the sparse format. For the experiments reported here we used a 5% threshold chosen empirically from the benchmark. To minimize the overhead of calculating the density, we compute it only before expensive contractions. Moreover, once the average density exceeds 95%, we stop calculating it.