Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Deep Autoregressive Models as Causal Inference Engines

Authors: Daniel Jiwoong Im, Kevin Zhang, Nakul Verma, Kyunghyun Cho

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct empirical studies across a variety of exemplar tasks, such as navigating mazes, playing chess endgames, and evaluating the impact of specific keywords on paper acceptance rates. We evaluate our method across three environments: a maze setting for navigational decision-making, a chess environment analyzing strategic moves in king vs. king-rook endgames, and the Peer Read dataset which examines the impact of theorem presence on academic paper acceptance.
Researcher Affiliation	Academia	Daniel Jiwoong Im EMAIL Center for Data Science New York University Kevin Zhang EMAIL Computer Science Columbia University Nakul Verma EMAIL Computer Science Columbia University Kyunghyun Cho EMAIL Center for Data Science New York University
Pseudocode	No	The paper describes its methodology using mathematical formulations and descriptive text, but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks, or any structured, code-like procedural descriptions.
Open Source Code	Yes	Code is available at https://github.com/jiwoongim/Deep-Autoregressive-Models-as-Causal-Inference-Engines.
Open Datasets	Yes	We use the Peer Read dataset (Kang et al., 2018) to estimate causal effects in a semi-realistic setting with high-dimensional text confounders. Additionally, we evaluate our model on a semi-synthetic baseline using the Infant Health and Development Program (IHDP) data.
Dataset Splits	Yes	For the training dataset, we sample 500,000 two-move chess games per dataset based on Black s policy function. The testing dataset consists of all 223,660 valid starting positions. We report the mean and standard error of the ATE error on the test set across 30 different random train-test splits in Table 5.
Hardware Specification	Yes	The maze experiments were conducted on a single NVIDIA Tesla T4. All models for the chess and Peer Read experiments were trained on a single NVIDIA GeForce RTX 3090 in four and eight hours respectively.
Software Dependencies	No	The paper mentions software components like 'Adam optimizer', 'vanilla transformer', 'BERT', 'GPT', and 'Stockfish' but does not provide specific version numbers for any of these software libraries or frameworks. While links to pre-trained BERT and GPT model checkpoints are provided, the versions of the underlying software dependencies for replication are not specified.
Experiment Setup	Yes	For training, we use the Adam optimizer with a batch size of 64. The CI model is trained for 6,250 iterations, while the offline RL model is trained for 5,000 iterations. Training runs for 200 epochs with the Adam optimizer, a batch size of 4096, and a learning rate chosen to be as large as possible without overfitting. For all training phases, we trained for 100 epochs using the Adam optimizer with a batch size of 16. The learning rate was set as high as possible without overfitting.