Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Contrastive Estimation Reveals Topic Posterior Information to Linear Models

Authors: Christopher Tosh, Akshay Krishnamurthy, Daniel Hsu

JMLR 2021 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In Section 6, we validate our theoretical ﬁndings on a simulated topic recovery task, demonstrating that contrastive learning in our setting leads to recovery of topic posterior information. In Section 7, we apply our contrastive learning procedure to a semi-supervised document classiﬁcation task. We show that these embeddings generally outperform several natural baselines, particularly in the scarce labeled data regime.
Researcher Affiliation	Collaboration	Christopher Tosh EMAIL Department of Epidemiology and Biostatistics Memorial Sloan Kettering Cancer Center New York, NY 10065 Akshay Krishnamurthy EMAIL Microsoft Research New York, NY 10012 Daniel Hsu EMAIL Department of Computer Science and Data Science Institute Columbia University New York, NY 10027
Pseudocode	Yes	Algorithm 1 Contrastive Estimation with Documents Input: Corpus U of unlabeled documents. Initialize: S = . for i = 1, . . . , n do Sample x and x independently from unif(U); S S {(x(1), x(2), 1)} w.p. 1/2 {(x(1), x(2), 0)} w.p. 1/2 end for Solve the optimization problem (x(1),x(2),y) S y log 1 + e f(x(1),x(2)) + (1 y) log 1 + ef(x(1),x(2)) Select landmark documents l1, . . . , l M and embed ˆφ(x) = exp ˆf(x, li) : i [M] .
Open Source Code	No	The paper does not provide explicit statements about releasing source code, nor does it include links to repositories or supplementary materials containing code.
Open Datasets	Yes	The IMDB movie review sentiment classiﬁcation dataset (Maas et al., 2011) is a collection of movie reviews that are classiﬁed as either positive or negative in sentiment. ... The AG news topic classiﬁcation dataset is a collection of short news articles. The dataset as compiled by Zhang et al. (2015) has 4 classes... The DBpedia ontology dataset consists of short descriptions of entities extracted from Wikipedia articles. As compiled by Zhang et al. (2015), the dataset has 14 classes...
Dataset Splits	Yes	The AG news topic classiﬁcation dataset ... has 30k training examples per class, and 1900 test examples per class. We randomly selected 1k examples per class from the training set as labeled training data... The IMDB movie review sentiment classiﬁcation dataset (Maas et al., 2011) ... comes pre-separated into an unlabeled dataset of 50k examples, a training set of 12.5k examples per class, and a test set of 12.5k examples per class. We held out 1k examples from each class in the training set as labeled training data and 3.5k examples from each class in the test set as labeled testing data...
Hardware Specification	No	The paper mentions training neural networks and using PyTorch, but no specific hardware details such as GPU models, CPU types, or memory specifications are provided.
Software Dependencies	No	We used Re LU nonlinearities, layer normalization, and the default Py Torch initialization (Paszke et al., 2019). We optimized using Adam with learning rate 10 3 for 1k epochs and then continued training using SGD with learning rate 10 4 for 100 epochs... we ﬁt an LDA topic model using batch variational Bayes (Blei et al., 2003) on a dataset with scikit-learn s default choices of α = β = 1/K and used the resulting topics to generate the data.
Experiment Setup	Yes	To solve the optimization problem in Eq. (1), we trained three-layer neural networks with fully-connected layers using 512 nodes per hidden layer. We used Re LU nonlinearities, layer normalization, and the default Py Torch initialization (Paszke et al., 2019). We optimized using Adam with learning rate 10 3 for 1k epochs and then continued training using SGD with learning rate 10 4 for 100 epochs... For all methods, we used ℓ2-regularized logistic regression to ﬁt a linear classiﬁer on the labeled data, where the regularization parameter was chosen using three-fold crossvalidation.