Active Sampling for Text Classification with Subinstance Level Queries

Authors: Shayok Chakraborty, Ankita Singh6150-6158

AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our extensive empirical studies on six challenging datasets from the text mining domain corroborate the practical usefulness of our framework over competing baselines. Experiments and Results Datasets: We used 6 challenging datasets from the text mining domain to study the performance of our framework
Researcher Affiliation Academia Shayok Chakraborty, Ankita Singh Department of Computer Science, Florida State University
Pseudocode No The paper presents mathematical formulations and derivations but does not include any pseudocode or algorithm blocks.
Open Source Code No The paper does not include any explicit statement about releasing the source code for the described methodology or a link to a code repository.
Open Datasets Yes Datasets: We used 6 challenging datasets from the text mining domain to study the performance of our framework: (i) Hotel Reviews 2. Each rating varies from 1 to 5; we used 1 and 2 as the negative class, 4 and 5 as the positive class and discarded samples where the rating was 3; (ii) IMDB (Maas et al. 2011); (iii) SRAA (Nigam, Thrun, and Mitchell 1998); (iv) Review Polarity (Pang and Lee 2004); (v) Sentence Polarity (Pang and Lee 2005); and (vi) Wikipedia Movie Plots 3. 2https://www.kaggle.com/datafiniti/hotel-reviews 3https://www.kaggle.com/jrobischon/wikipedia-movie-plots
Dataset Splits No Each dataset was divided into 6 parts: (i) oracle training data (to train the oracle model); (ii) oracle testing data (to test the oracle and compute the oracle prediction threshold T); (iii) neutrality training data (to train the SVM neutrality model); (iv) initial training set L; (v) unlabeled set U; and (vi) test set. The paper specifies initial training and test sets, but does not explicitly mention a validation set or describe its split for model training.
Hardware Specification No The paper does not provide specific hardware details (e.g., CPU/GPU models, memory) used for running the experiments.
Software Dependencies No The paper mentions using a Logistic Regression classifier, a binary SVM classifier, and an off-the-shelf LP solver, but does not specify any version numbers for these software dependencies.
Experiment Setup Yes We selected K = 5 as the number of subinstance sizes and split each unlabeled sample at 20%, 40%, 60%, 80% and 100% granularities from the start. The cost vector was defined as Q = [1, 2, 3, 4, 5]... A query budget B was imposed in each AL iteration... The process was continued iteratively until a stopping condition was satisfied (taken as 25 iterations, except for the Hotel Reviews dataset where it was taken as 10 iterations...). The query budget B was selected as 50 for each AL iteration. All the results were averaged over 3 runs... The weight parameter λ was taken as 0.5 and the Gaussian kernel was used to compute the diversity in Equation (2).