Active Sampling for Text Classification with Subinstance Level Queries
Authors: Shayok Chakraborty, Ankita Singh6150-6158
AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our extensive empirical studies on six challenging datasets from the text mining domain corroborate the practical usefulness of our framework over competing baselines. Experiments and Results Datasets: We used 6 challenging datasets from the text mining domain to study the performance of our framework |
| Researcher Affiliation | Academia | Shayok Chakraborty, Ankita Singh Department of Computer Science, Florida State University |
| Pseudocode | No | The paper presents mathematical formulations and derivations but does not include any pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not include any explicit statement about releasing the source code for the described methodology or a link to a code repository. |
| Open Datasets | Yes | Datasets: We used 6 challenging datasets from the text mining domain to study the performance of our framework: (i) Hotel Reviews 2. Each rating varies from 1 to 5; we used 1 and 2 as the negative class, 4 and 5 as the positive class and discarded samples where the rating was 3; (ii) IMDB (Maas et al. 2011); (iii) SRAA (Nigam, Thrun, and Mitchell 1998); (iv) Review Polarity (Pang and Lee 2004); (v) Sentence Polarity (Pang and Lee 2005); and (vi) Wikipedia Movie Plots 3. 2https://www.kaggle.com/datafiniti/hotel-reviews 3https://www.kaggle.com/jrobischon/wikipedia-movie-plots |
| Dataset Splits | No | Each dataset was divided into 6 parts: (i) oracle training data (to train the oracle model); (ii) oracle testing data (to test the oracle and compute the oracle prediction threshold T); (iii) neutrality training data (to train the SVM neutrality model); (iv) initial training set L; (v) unlabeled set U; and (vi) test set. The paper specifies initial training and test sets, but does not explicitly mention a validation set or describe its split for model training. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., CPU/GPU models, memory) used for running the experiments. |
| Software Dependencies | No | The paper mentions using a Logistic Regression classifier, a binary SVM classifier, and an off-the-shelf LP solver, but does not specify any version numbers for these software dependencies. |
| Experiment Setup | Yes | We selected K = 5 as the number of subinstance sizes and split each unlabeled sample at 20%, 40%, 60%, 80% and 100% granularities from the start. The cost vector was defined as Q = [1, 2, 3, 4, 5]... A query budget B was imposed in each AL iteration... The process was continued iteratively until a stopping condition was satisfied (taken as 25 iterations, except for the Hotel Reviews dataset where it was taken as 10 iterations...). The query budget B was selected as 50 for each AL iteration. All the results were averaged over 3 runs... The weight parameter λ was taken as 0.5 and the Gaussian kernel was used to compute the diversity in Equation (2). |