Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Contrastive Estimation Reveals Topic Posterior Information to Linear Models
Authors: Christopher Tosh, Akshay Krishnamurthy, Daniel Hsu
JMLR 2021 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In Section 6, we validate our theoretical findings on a simulated topic recovery task, demonstrating that contrastive learning in our setting leads to recovery of topic posterior information. In Section 7, we apply our contrastive learning procedure to a semi-supervised document classification task. We show that these embeddings generally outperform several natural baselines, particularly in the scarce labeled data regime. |
| Researcher Affiliation | Collaboration | Christopher Tosh EMAIL Department of Epidemiology and Biostatistics Memorial Sloan Kettering Cancer Center New York, NY 10065 Akshay Krishnamurthy EMAIL Microsoft Research New York, NY 10012 Daniel Hsu EMAIL Department of Computer Science and Data Science Institute Columbia University New York, NY 10027 |
| Pseudocode | Yes | Algorithm 1 Contrastive Estimation with Documents Input: Corpus U of unlabeled documents. Initialize: S = . for i = 1, . . . , n do Sample x and x independently from unif(U); S S {(x(1), x(2), 1)} w.p. 1/2 {(x(1), x(2), 0)} w.p. 1/2 end for Solve the optimization problem (x(1),x(2),y) S y log 1 + e f(x(1),x(2)) + (1 y) log 1 + ef(x(1),x(2)) Select landmark documents l1, . . . , l M and embed ˆφ(x) = exp ˆf(x, li) : i [M] . |
| Open Source Code | No | The paper does not provide explicit statements about releasing source code, nor does it include links to repositories or supplementary materials containing code. |
| Open Datasets | Yes | The IMDB movie review sentiment classification dataset (Maas et al., 2011) is a collection of movie reviews that are classified as either positive or negative in sentiment. ... The AG news topic classification dataset is a collection of short news articles. The dataset as compiled by Zhang et al. (2015) has 4 classes... The DBpedia ontology dataset consists of short descriptions of entities extracted from Wikipedia articles. As compiled by Zhang et al. (2015), the dataset has 14 classes... |
| Dataset Splits | Yes | The AG news topic classification dataset ... has 30k training examples per class, and 1900 test examples per class. We randomly selected 1k examples per class from the training set as labeled training data... The IMDB movie review sentiment classification dataset (Maas et al., 2011) ... comes pre-separated into an unlabeled dataset of 50k examples, a training set of 12.5k examples per class, and a test set of 12.5k examples per class. We held out 1k examples from each class in the training set as labeled training data and 3.5k examples from each class in the test set as labeled testing data... |
| Hardware Specification | No | The paper mentions training neural networks and using PyTorch, but no specific hardware details such as GPU models, CPU types, or memory specifications are provided. |
| Software Dependencies | No | We used Re LU nonlinearities, layer normalization, and the default Py Torch initialization (Paszke et al., 2019). We optimized using Adam with learning rate 10 3 for 1k epochs and then continued training using SGD with learning rate 10 4 for 100 epochs... we fit an LDA topic model using batch variational Bayes (Blei et al., 2003) on a dataset with scikit-learn s default choices of α = β = 1/K and used the resulting topics to generate the data. |
| Experiment Setup | Yes | To solve the optimization problem in Eq. (1), we trained three-layer neural networks with fully-connected layers using 512 nodes per hidden layer. We used Re LU nonlinearities, layer normalization, and the default Py Torch initialization (Paszke et al., 2019). We optimized using Adam with learning rate 10 3 for 1k epochs and then continued training using SGD with learning rate 10 4 for 100 epochs... For all methods, we used ℓ2-regularized logistic regression to fit a linear classifier on the labeled data, where the regularization parameter was chosen using three-fold crossvalidation. |