Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Textual Membership Queries

Authors: Jonathan Zarecki, Shaul Markovitch

IJCAI 2020 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We implement this framework in the textual domain and test it on several text classiﬁcation tasks and show improved classiﬁer performance as more MQs are labeled and incorporated into the training set.4 Empirical Evaluation We analyzed the performance of our framework on 5 publicly available sentence classiﬁcation datasets.
Researcher Affiliation	Academia	Jonathan Zarecki and Shaul Markovitch Department of Computer Science, Technion Israel Institute of Technology EMAIL, EMAIL
Pseudocode	Yes	Algorithm 1: Stochastic query synthesis Algorithm 2: Search-based query synthesis
Open Source Code	Yes	The code for all experiments is available here2. 2www.github.com/jonzarecki/textual-mqs
Open Datasets	Yes	We report results on 5 binary sentence classiﬁcation datasets: three sentiment analysis datasets, one sentence subjectivity dataset, and one hate-speech detection dataset. CMR: Cornell sentiment polarity dataset [Pang and Lee, 2005]. SST: Stanford sentiment treebank, a sentence sentiment analysis dataset [Socher et al., 2013]. KS: A Kaggle short sentence sentiment analysis dataset. 3 HS: Hate speech and offensive language classiﬁcation dataset [Davidson et al., 2017]. SUBJ: Cornell sentence subjective / objective dataset [Pang and Lee, 2004].
Dataset Splits	No	The paper mentions evaluating against a test set and states cross-validation accuracy for an artificial expert, but does not specify the train/validation/test splits (e.g., percentages or counts) for the datasets used to train their own models.
Hardware Specification	No	The paper does not provide specific details about the hardware used to run the experiments, such as GPU or CPU models, or memory specifications.
Software Dependencies	No	The paper mentions software tools like Dependency Word2vec and Spacy's 'latest' part-of-speech parser, but it does not provide specific version numbers for any software dependencies or libraries used in the experiments.
Experiment Setup	Yes	We used a core set of 10 sentences, a pool size of 20, an AL batch size of 5, and the uncertainty samplingbased [Lewis and Gale, 1994] heuristic function as U for all experiments. All methods used an environment size of 10, and a linear classiﬁer with an average 300-dim Glo Ve [Pennington et al., 2014] word-vectors as the learner.