Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Contrastive Representations for Temporal Reasoning

Authors: Alicja Ziarko, Michał Bortkiewicz, Michał Zawalski, Benjamin Eysenbach, Piotr Miłoś

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments aim to answer the following specific research questions: 1. Does learning representations that ignore context improve performance on combinatorial reasoning problems? (Sec. 5.2) 2. Do learned representations alone suffice for reasoning, or is explicit search essential? (Sec. 5.3) 3. Are representation learning methods that remove context competitive with successful prior methods for combinatorial reasoning? (Sec. 5.2) 4. What is the relative importance of design decisions, such as how the negatives are sampled and the number of in-trajectory negatives? (Sec. 5.4)
Researcher Affiliation	Collaboration	Alicja Ziarko1 2 3 Michał Bortkiewicz4 Michał Zawalski1, 6 Benjamin Eysenbach5 Piotr Miło s1 3 * 1University of Warsaw 2IDEAS NCBR 3IMPAN 4Warsaw University of Technology 5Princeton University 6NVIDIA
Pseudocode	Yes	Algorithm 1 CRTR performs temporal contrastive learning, but samples negatives in a different way so that representations discard task-irrelevant context, boosting performance (See Fig. 2). [...] Algorithm 2 Best-First Search [29]
Open Source Code	Yes	Code to reproduce our experiments is available online: https://github. com/Princeton-RL/CRTR.
Open Datasets	Yes	For Sokoban, we use trajectories provided by Czechowski et al. [14] [...] To investigate whether CRTR also identifies temporal features in non-combinatorial domains, we apply it to a dataset of robotic manipulation trajectories (the Adroit dataset from D4RL [23]).
Dataset Splits	Yes	For Sokoban, we construct a separate test set comprising 100 trajectories, which is used to compute evaluation metrics such as accuracy, correlation, and t-SNE visualizations. For all other environments, a separate test set is unnecessary, as we train for only a single epoch.
Hardware Specification	Yes	All training experiments were conducted using NVIDIA A100 GPUs and took between 5 and 48 hours each. The solving runs ranged from 10 minutes to 10 hours. In total, the project required approximately 30,000 GPU hours to complete.
Software Dependencies	No	The paper mentions "Adam optimizer" but does not specify its version. It also mentions "NPEET package" but without a version number. No other specific software versions are provided.
Experiment Setup	Yes	We use the Adam optimizer with a constant learning rate throughout training. A learning rate of 0.0003 was found to perform well across all environments, with the exception of Lights Out, where this setting led to unstable training. For this environment, we instead use a reduced learning rate of 0.0001. In all environments, we use a batch size of 512. [...] We adopt the network architecture proposed by Nauman et al. [45], using 8 layers with a hidden size of 512 and a representation dimension of 64. [...] We set the temperature parameter in the contrastive loss to the square root of the representation dimension.