Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Generalizable Reasoning through Compositional Energy Minimization

Authors: Alexandru Oarga, Yilun Du

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate our approach on a wide set of reasoning problems. Our method outperforms existing state-of-the-art methods, demonstrating its ability to generalize to larger and more complex problems. [...] We illustrate the applicability of our approach across a set of difficult reasoning problems, including the N-Queens, 3-SAT and the Graph Coloring. We compare against domain-specific state of the art combinatorial optimization models, and show that our approach outperforms them in terms of solution quality and generalization to larger and more complex problems.
Researcher Affiliation	Academia	Alexandru Oarga University of Barcelona Yilun Du Harvard University
Pseudocode	Yes	Algorithm 1 Parallel Energy Minimization (PEM) Input: T optimization steps, P particles
Open Source Code	Yes	Project website can be found at: https://alexoarga.github.io/compositional_reasoning/Additionally, the NeurIPS Paper Checklist states: "Data and code for reproduction are released as supplementary material."
Open Datasets	Yes	We evaluate using the SATLIB benchmark [34]. For evaluation, we use graphs from the well-known COLOR benchmark1. We evaluate on the Crosswords Mini Benchmark introduced in [71]. To train our approach, we sample 32.7k and 6.8k five-letter words from the Crosswords QA dataset [63] for training and validation, respectively.
Dataset Splits	Yes	During training we use only one single instance of the N-queens problem for a given value N. [...] We generated 4000 random satisfiable 3-SAT instances for training and 1000 for validation, using the cnfgen Python package. [...] We generated 1000 random graphs following the approach from [41]. [...] We then make a 90-10 split for training and validation. [...] To train our approach, we sample 32.7k and 6.8k five-letter words from the Crosswords QA dataset [63] for training and validation, respectively.
Hardware Specification	Yes	With a single Nvidia A10 GPU with 24GB of memory, the model was trained in approximately 5 hours.
Software Dependencies	No	The paper mentions "Adam W optimizer" and "cnfgen Python package" but does not specify version numbers for these or other key software components such as Python itself, or deep learning frameworks like PyTorch or TensorFlow.
Experiment Setup	Yes	As a model, we used a 3-layer MLP, with each layer having: layer normalization and 3 linear layers of dimensions 128, 256, 128, followed by a Re LU activation. We added skip connections for each layer. The model was trained with a learning rate of 1e 4 with Adam W optimizer for 20000 epochs with a batch size of 2048. For the contrastive loss, we used a weight of 0.5. For scheduled noise we used a linear schedule with T = 100 timesteps.