Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Knowledge-Based Textual Inference via Parse-Tree Transformations

Authors: Roy Bar-Haim, Ido Dagan, Jonathan Berant

JAIR 2015 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	The utility of our approach was illustrated on two tasks: unsupervised relation extraction from a large corpus, and the Recognizing Textual Entailment (RTE) benchmarks. We proved the correctness of the new algorithm and established its eﬃciency analytically and empirically. We evaluate both the quality of the system s output (in terms of accuracy, precision, and recall) and its computational eﬃciency (in terms of running time and space, using various application settings. The results are reported in Table 4. The results are summarized in Table 5. Table 6 provides statistics on rule applications... The accuracies obtained in this experiment are shown in Table 7... Table 8 provides a more detailed view of our system s performance. Tables 9 and 10 illustrate the usage and contribution of individual rule bases.
Researcher Affiliation	Academia	Ido Dagan EMAIL Computer Science Department, Bar-Ilan University Ramat-Gan 52900, Israel Jonathan Berant EMAIL Computer Science Department, Stanford University
Pseudocode	Yes	The formalism is presented in more detail, including further examples and pseudo-code for its algorithms. ... Algorithm 1: Applying a rule to a tree ... Algorithm 2: Applying an inference rule to a compact forest
Open Source Code	No	The paper does not contain an explicit statement or link providing access to the source code for the methodology described.
Open Datasets	Yes	The utility of our approach was illustrated on two tasks: unsupervised relation extraction from a large corpus, and the Recognizing Textual Entailment (RTE) benchmarks. ... To compare explicit and compact inference we randomly sampled 100 pairs from the RTE-3 development set... Table 6 provides statistics on rule applications using all rule bases, over the RTE-3 development set and the RTE-4 dataset14. ... The output of TEASE and DIRT, as well as many other knowledge resources, is available from the RTE knowledge resources page: http://aclweb.org/aclwiki/index.php?title=RTE_Knowledge_Resources
Dataset Splits	Yes	To compare explicit and compact inference we randomly sampled 100 pairs from the RTE-3 development set... The system was trained on the RTE-3 development set, and was tested on the RTE3 and RTE-4 test sets (no development set was released for RTE-4). For this evaluation we randomly sampled 75 pairs from the RTE-3 test set...
Hardware Specification	No	The paper mentions running time and efficiency but does not specify any particular hardware components (e.g., CPU, GPU models) used for the experiments.
Software Dependencies	No	The paper mentions using Minipar for parsing and an SVM classifier, but it does not specify version numbers for these software components or any other libraries/frameworks.
Experiment Setup	Yes	In the current system we implemented a simple search strategy, in the spirit of (de Salvo Braz et al., 2005): ﬁrst, we applied three exhaustive iterations of generic rules. ... At each iteration we ﬁrst ﬁnd all rule matches, and then apply all matched rules. ... We then perform a single iteration of all other lexical and lexical-syntactic rules, applying them only if their L part was matched in F and their R part was matched in h. ... Following inference, a set of features is extracted from the resulting F and from h and fed into an SVM classiﬁer, which determines entailment. Co-reference substitution was disabled due to the insuﬃcient accuracy of the co-reference resolution tool we used.