Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Synthesizing Argumentation Frameworks from Examples

Authors: Andreas Niskanen, Johannes P. Wallner, Matti Järvisalo

JAIR 2019 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Going beyond deﬁning the AF synthesis problem, we study both theoretical and practical aspects of the problem. In particular, we ... (iv) empirically evaluate our algorithms on different forms of AF synthesis instances
Researcher Affiliation	Academia	Helsinki Institute for Information Technology HIIT, Department of Computer Science, University of Helsinki, Finland; Institute of Logic and Computation, TU Wien, Austria
Pseudocode	Yes	Algorithm 1 CEGAR-based AF synthesis on input (A, E+, E , σ = prf ).
Open Source Code	Yes	The implementations and benchmarks used in the evaluation are available online at http://www.cs.helsinki.fi/group/coreo/afsynth/.
Open Datasets	Yes	The ﬁrst set of benchmarks was generated based on the benchmark AFs used in the ICCMA 15 competition (Thimm et al., 2016)
Dataset Splits	Yes	For each AF, we picked uniformly at random 5 positive examples from the set of extensions. To obtain negative examples, we selected 10, 20, . . . , 150 subsets of S SE+ uniformly at random... The second set of benchmarks was generated using the following random model. We picked 5, 10, . . . , 80 positive examples from a set of 100 arguments uniformly at random with probability p+ arg = 0.25. Then \|E-\| = 20, 40, . . . , 200 negative examples were sampled from the set A = S SE+, and each argument was included with probability p arg = P e E+ \|Se\|/\|E+\| \| S SE+\| . Again, each example was assigned as weight a random integer from the interval [1, 10]. For each choice of parameters, this procedure was repeated 10 times to obtain a representative set of benchmarks. The instances for preferred semantics were generated following the same random model, using \|A\| = 20, \|E+\| = 5, 10, 15, 20 and \|E-\| = 10, 20, 30, 40, 50 as parameters.
Hardware Specification	Yes	The experiments were run on 2.83-GHz Intel Xeon E5440 quad-core machines with 32-GB memory and Debian GNU/Linux 8 using a per-instance timeout of 900 seconds.
Software Dependencies	Yes	For the experiments, we used a variety of state-of-the-art Max SAT solvers...: Max HS (version 2.9.0)... Maxino (version k16)... MSCG (version 2014)... Open-WBO (version 1.3.1)... and WPM3 (version 2015.co)... As the ... ASP solver, we used Clingo (version 5.2.1)... we used the SATIP hybrid Max SAT solver LMHS (Max SAT evaluation 2016 version)... Furthermore, we used Mini SAT... (version 2.2.0) as the SAT solver within the CEGAR approach.
Experiment Setup	No	The paper describes how benchmark instances were generated and evaluated, but it does not provide specific hyperparameters for any model training or system-level configuration details beyond the hardware and software used.