Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Theoretical Analysis of Weak-to-Strong Generalization
Authors: Hunter Lang, David Sontag, Aravindan Vijayaraghavan
NeurIPS 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 6 Experiments Setup. We explore training linear classifiers on top of the contrastively-fine-tuned Sentence BERT embeddings [52]. As shown in Muennighoff et al. [44], training simple classifiers on top of these complex pretrained representations leads to very competitive performance. We study binary sentiment prediction for movie reviews on the IMDb dataset [41], continuing with the example from Section 3. |
| Researcher Affiliation | Academia | Hunter Lang MIT CSAIL David Sontag MIT CSAIL Aravindan Vijayaraghavan Northwestern University |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | The code for reproducing our experimental results is included in the supplemental material. |
| Open Datasets | Yes | We train on the IMDb dataset of movie reviews [41] (Hugging Face Hub ID stanfordnlp/imdb) |
| Dataset Splits | Yes | We train on the IMDb dataset of movie reviews [41] (Hugging Face Hub ID stanfordnlp/imdb), which has 25000 training examples and 25000 test examples, each with exactly 50/50 positive/negative split... and we retrain a model 5 times on 5 different random subsets of the covered training samples, each 80% of the original, and use the other 20% of covered samples as a validation set to perform early stopping with the weak label. |
| Hardware Specification | Yes | We used an internal machine with 4x A100 80GB GPUs to extract all deep network embeddings and to train the linear classifiers on top of those embeddings. |
| Software Dependencies | No | The paper mentions using Adam W optimizer and Sentence BERT embeddings but does not provide specific version numbers for general software dependencies like Python, PyTorch, or TensorFlow libraries. |
| Experiment Setup | Yes | We train the linear classifiers using the Adam W optimizer [40] with global learning rate 0.01 and a weight decay of 0.1, and linear learning rate decay over 500 optimizer steps. |