Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Imbalances in Neurosymbolic Learning: Characterization and Mitigating Strategies

Authors: Efthymia Tsamoura, Kaifu Wang, Dan Roth

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our empirical analysis shows that our techniques can improve the accuracy over strong baselines in NSL [71, 67] and long-tailed learning [42, 18] by up to 14% and that the straightforward application of previous state-of-the-art to NESY is impossible [65] or problematic [18]. We consider the state-of-the-art loss semantic loss (SL) [71, 67, 20] and use the engine Scallop [20] that performs NESY training using that loss. Benchmarks. We carry experiments using NESY benchmarks previously used in the NSL literature [36, 38, 20, 28], namely MAX-M, SUM-M [36, 20] and HWF-M [28, 30], as well as a newly introduced, called Smallest Parent. Training samples in MAX-M are as described in Example 1.1. We vary M to {3, 4, 5} and use the MNIST benchmark to obtain training and testing instances. The results of our analysis are summarized in Table 1 and Figure 4. The accuracies in all the tables (obtained over three different for low-variance scenarios and ten runs over high-variance scenarios) are balanced, i.e., they are the weighted sums of the class-specific accuracies, where each weight is the ratio of the corresponding class in the test data.
Researcher Affiliation	Collaboration	Efthymia Tsamoura Huawei Labs EMAIL Kaifu Wang University of Pennsylvania EMAIL Dan Roth University of Pennsylvania EMAIL
Pseudocode	Yes	Algorithm 1 LABEL RATIO SOLVER Input: weak labels {sk}m P k=1, function σ, step size t, iterations Niter Initialize: logit u 1c; pj, for j [c S] for N = 1, . . . , Niter do br softmax(u) for each j [c S] do (y1,...,y M) σ 1(aj) QM i=1 bryi ℓ Pc S j=1 pjlog bpj Backpropagate ℓto update u return softmax(u) Algorithm 2 CAROT Input: model s raw scores P Rc n, ratio estimates br Rc, entropic reg. parameter η > 0, margin reg. parameter τ > 0, iterations Niter Initialize: u 0n; v 0c for N = 1, . . . , Niter do a B(u, v)1c; b B(u, v)T1n if k is even then update v //see Section 4.3 else update u //see Section 4.3 return B(u, v)
Open Source Code	Yes	The source code to run our empirical analysis are available at https://github.com/tsamoura/ imbalances-nsl.
Open Datasets	Yes	We aim to learn an MNIST classifier f, using only samples of the form (x1, x2, s), where x1 and x2 are MNIST digits... To create datasets for MAX-M, Smallest Parent, SUM-M, and HWF-M we adopted the approach followed in previous work [12, 61, 67, 37, 20]. In particular, to create each training sample, we drew instances x1, . . . , x M independently by MNIST or CIFAR-10.
Dataset Splits	Yes	Before sample creation, the images in HWF were split into training and testing ones with ratio 70%/30%, as the benchmark does not offer such splits.
Hardware Specification	Yes	The experiments ran on an 64-bit Ubuntu 22.04.3 LTS machine with Intel(R) Xeon(R) Gold 6130 CPU @ 2.10GHz, 3.16TB hard disk and an NVIDIA Ge Force RTX 2080 Ti GPU with 11264 Mi B RAM. We used CUDA version 12.2.
Software Dependencies	Yes	Our source code was implemented in Python 3.9. We used the following python libraries: scallopy7, highspy8, or-tools9, Py SDD10, Py Torch and Py Torch vision. Finally, we used part of the code11 available at [18] to implement RECORDS and part of the code12 available at [65] to implement the sliding window approximation for marginal estimation. We used CUDA version 12.2.
Experiment Setup	Yes	We consider a range of different learning rates (LR) ({0.1, 0.01, 0.001, 0.0001}) and temperatures ({0.1, 0.5, 1, 2, 5}) when running Algorithm 1. In each run, we randomly generate (1) a true label ratio and (2) 20 initialization points for Algorithm 1. We run the Adam optimizer for 10,000 iterations and compute the total variation (TV) distance between the estimated label ratio and the gold label ratio. For the Smallest Parent scenarios, we computed SL and (5) using the whole pre-image of each weak label. For the MAX-M scenarios, we only consider the top-1 proof [67] both when running Scallop and in (5) as the space of pre-images is very large.