reproducibilityindex.ai

Active Sampling for Text Classification with Subinstance Level Queries

Authors: Shayok Chakraborty, Ankita Singh6150-6158

AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our extensive empirical studies on six challenging datasets from the text mining domain corroborate the practical usefulness of our framework over competing baselines. Experiments and Results Datasets: We used 6 challenging datasets from the text mining domain to study the performance of our framework
Researcher Affiliation	Academia	Shayok Chakraborty, Ankita Singh Department of Computer Science, Florida State University
Pseudocode	No	The paper presents mathematical formulations and derivations but does not include any pseudocode or algorithm blocks.
Open Source Code	No	The paper does not include any explicit statement about releasing the source code for the described methodology or a link to a code repository.
Open Datasets	Yes	Datasets: We used 6 challenging datasets from the text mining domain to study the performance of our framework: (i) Hotel Reviews 2. Each rating varies from 1 to 5; we used 1 and 2 as the negative class, 4 and 5 as the positive class and discarded samples where the rating was 3; (ii) IMDB (Maas et al. 2011); (iii) SRAA (Nigam, Thrun, and Mitchell 1998); (iv) Review Polarity (Pang and Lee 2004); (v) Sentence Polarity (Pang and Lee 2005); and (vi) Wikipedia Movie Plots 3. 2https://www.kaggle.com/dataﬁniti/hotel-reviews 3https://www.kaggle.com/jrobischon/wikipedia-movie-plots
Dataset Splits	No	Each dataset was divided into 6 parts: (i) oracle training data (to train the oracle model); (ii) oracle testing data (to test the oracle and compute the oracle prediction threshold T); (iii) neutrality training data (to train the SVM neutrality model); (iv) initial training set L; (v) unlabeled set U; and (vi) test set. The paper specifies initial training and test sets, but does not explicitly mention a validation set or describe its split for model training.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., CPU/GPU models, memory) used for running the experiments.
Software Dependencies	No	The paper mentions using a Logistic Regression classiﬁer, a binary SVM classiﬁer, and an off-the-shelf LP solver, but does not specify any version numbers for these software dependencies.
Experiment Setup	Yes	We selected K = 5 as the number of subinstance sizes and split each unlabeled sample at 20%, 40%, 60%, 80% and 100% granularities from the start. The cost vector was deﬁned as Q = [1, 2, 3, 4, 5]... A query budget B was imposed in each AL iteration... The process was continued iteratively until a stopping condition was satisﬁed (taken as 25 iterations, except for the Hotel Reviews dataset where it was taken as 10 iterations...). The query budget B was selected as 50 for each AL iteration. All the results were averaged over 3 runs... The weight parameter λ was taken as 0.5 and the Gaussian kernel was used to compute the diversity in Equation (2).