Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

CoT Information: Improved Sample Complexity under Chain-of-Thought Supervision

Authors: Awni Altabaa, Omar Montasser, John D. Lafferty

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	This paper develops a statistical theory of learning under Co T supervision. Central to the theory is the Co T information, which measures the additional discriminative power offered by the chain-of-thought for distinguishing hypotheses with different end-to-end behaviors. The main theoretical results demonstrate how Co T supervision can yield signiﬁcantly faster learning rates compared to standard end-to-end supervision, with both upper bounds and information-theoretic lower bounds characterized by the Co T information. This section presents numerical simulations empirically exploring the Co T information measure for simple Co T hypothesis classes and its ability to predict sample complexity gains from Co Tsupervised learning. Figure 4: Numerical experiments for deterministic ﬁnite automata Co T hypothesis class. Figure 5: Numerical experiments for iterated linear thresholds Co T hypothesis class.
Researcher Affiliation	Academia	Awni Altabaa Statistics & Data Science Yale University EMAIL Omar Montasser Statistics & Data Science Yale University EMAIL John Lafferty Statistics & Data Science Yale University EMAIL
Pseudocode	No	The paper describes learning rules like "chain-of-thought consistency, Co T-Cons(S; H)" and "Co T empirical risk minimization, Co T-ERM(S; H)" in paragraph text, outlining their functionality. However, it does not present these or any other procedures in structured pseudocode blocks or clearly labeled algorithm environments.
Open Source Code	No	The paper is mainly theoretical, but does include a few empirical simulation results. The code will be made publicly available.
Open Datasets	No	The paper conducts numerical simulations using synthetically generated data or constructed hypothesis classes rather than established, publicly available datasets. For instance, in the DFA experiments, it states: "In these simulations, we ﬁx the size of the state space S and vocabulary Σ, as well as choose an initial state and acceptance state, and generate H as the set of all automata operating on those spaces. We place a uniform distribution over the input space D = Unif(X) = Unif(Σn)." Similar generation is implied for the iterated linear thresholds.
Dataset Splits	No	The paper does not utilize external datasets with predefined splits. The data for simulations are generated internally, and thus traditional training/test/validation splits are not applicable or mentioned. The NeurIPS checklist explicitly states under 'Experimental setting/details': "Justiﬁcation: There are no data splits, hyperparameters, etc."
Hardware Specification	No	The paper does not provide specific details about the hardware used for running its simulations. The NeurIPS checklist for 'Experiments compute resources' has an 'NA' answer with the justification: "Justiﬁcation: The simulations are simple and require modest computational resources." This statement does not specify any particular CPU, GPU, or other hardware component.
Software Dependencies	No	The paper does not explicitly list any specific software components (e.g., programming languages, libraries, frameworks) with their version numbers that were used to conduct the simulations or theoretical work. There is no mention of tools like Python, PyTorch, TensorFlow, etc., or their specific versions.
Experiment Setup	Yes	In these simulations, we ﬁx the size of the state space S and vocabulary Σ, as well as choose an initial state and acceptance state, and generate H as the set of all automata operating on those spaces. We place a uniform distribution over the input space D = Unif(X) = Unif(Σn). For n = 10, we see that this value is roughly 600. We take the window size to be d = 8 and the number of iterations to be T = 16. We repeat this for 500 independent trials to estimate the distribution of Re2e D (A(Sm)) as a function of the sample size m for each learning rule.