Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Incremental Sequence Classification with Temporal Consistency

Authors: Lucas Maystre, Gabriel Barello, Tudor Berariu, Cambray, Rares Dolga, Alvaro Ortega Gonzalez, Andrei Nica, David Barber

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we apply our methodology to text classification with decoder-only transformers. First, in Section 4.1, we evaluate multiple different approaches to training incremental classifiers. We compare the predictive performance of models on four well-known text classification benchmarks. Then, in Section 4.2, we consider a concrete application to verifying LLM generations.
Researcher Affiliation	Collaboration	Lucas Maystre Ui Path London, UK Gabriel Barello Ui Path Bellevue, WA, USA Tudor Berariu Ui Path London, UK Aleix Cambray Ui Path London, UK Rares Dolga Ui Path & UCL London, UK Alvaro Ortega Gonzalez Ui Path London, UK Andrei Nica Ui Path London, UK David Barber Ui Path & UCL London, UK
Pseudocode	Yes	Algorithm 1 presents one step of the training loop for the TC-λ and DCE approaches. Note that DCE is obtained simply by setting λ = 1, as explained in Section 2.1.
Open Source Code	No	We intend to complement this information with a comprehensive code release upon publication. We intend to release code with instructions to support the reproduction of our main experiments upon publication.
Open Datasets	Yes	We consider four text classification datasets, spanning tasks such as movie review sentiment prediction (IMDB [25]) and topic classification (OHSUMED [30], NEWSGROUPS [19], AG-NEWS [10])....The NEWSGROUPS, IMDB and AG-NEWS datasets are provided with separate train and test splits, which we reuse as-is. For OHSUMED, we create our own train and test splits, by partitioning the data uniformly at random. ...We study GSM8K, a dataset of grade-school math problems and their solutions [8]. For our experiments, we use Qwen2.5-0.5B, a pre-trained language model with 0.5 B parameters that is known to perform well on GSM8K for its size [46].
Dataset Splits	Yes	The NEWSGROUPS, IMDB and AG-NEWS datasets are provided with separate train and test splits, which we reuse as-is. For OHSUMED, we create our own train and test splits, by partitioning the data uniformly at random.
Hardware Specification	Yes	We run our experiments on an a3-highgpu-8g instance on Google Cloud, with 208 v CPUs, 1872 GB of memory, and 8 NVIDIA H100 GPUs.
Software Dependencies	No	The paper mentions software components like "AdamW optimizer [24]", "OPT family [49]", "Qwen2.5-0.5B [46]", and "GPT-2 tokenizer [33]", but it does not specify the version numbers for the underlying software frameworks (e.g., PyTorch, TensorFlow, Python) or the specific versions of the optimizers/toolkits used.
Experiment Setup	Yes	We run experiments on a grid of hyperparameter configurations, and we select the configuration that maximizes the full-sequence predictive accuracy on a small dataset of held-out sequences. Finally, throughout section 4, we report the mean and standard deviation of the performance of the winning hyperparameter configuration on the full test set, across 10 training runs with different random seeds. Table 3: Hyperparameters used to fine-tune the OPT-125M models reported in the paper.