Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Learning a Fast Mixing Exogenous Block MDP using a Single Trajectory
Authors: Alexander Levine, Peter Stone, Amy Zhang
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We prove that STEEL is correct and sample-efficient, and demonstrate STEEL on two toy problems. Code is available at: https://github.com/midi-lab/steel. SIMULATION EXPERIMENTS We test the STEEL algorithm on two toy problems: an infinite-horizon environment inspired by the combination lock environment from Efroni et al. (2022b), and a version of the multi-maze environment from Lamb et al. (2023). ... Results are shown in Table 2. |
| Researcher Affiliation | Collaboration | Alexander Levine The University of Texas at Austin Peter Stone The University of Texas at Austin Sony AI Amy Zhang The University of Texas at Austin |
| Pseudocode | Yes | A FULL ALGORITHM The STEEL algorithm is presented here in full as Algorithm 1, with a major subroutine, Cycle Find, split out as Algorithm 2. |
| Open Source Code | Yes | Code is available at: https://github.com/midi-lab/steel. |
| Open Datasets | No | We test the STEEL algorithm on two toy problems: an infinite-horizon environment inspired by the combination lock environment from Efroni et al. (2022b), and a version of the multi-maze environment from Lamb et al. (2023). |
| Dataset Splits | No | When assessing the minimum-range minimal loss encoder as in Equation 140, Levine et al. (2024) include an empirical fudge factor : their protocol returns the minimumrange encoder that achieves a loss within 0.1% of the true minimum loss over all possible encoders. We found that even without this fudge factor, the multistep-inverse methods were still heavily biased towards returning ϕSR, when applied either to Single Prime(p, q); or to Double Prime(p, q) with too-small K or too-few samples. ... To better match the spirit of learning an Ex-BMDP from a single trajectory , we collect the fitting and optimization sets from one single trajectory, one after the other, with each corresponding to half of the trajectory: we regard the total sample complexfity as the length of the entire trajectory. |
| Hardware Specification | No | The paper describes simulation experiments but does not provide specific hardware details such as GPU/CPU models or other computing infrastructure used for running these simulations. |
| Software Dependencies | No | The paper does not mention specific software dependencies with version numbers, such as programming languages or libraries (e.g., Python, PyTorch, TensorFlow versions). |
| Experiment Setup | Yes | For all experiments, we set δ = ϵ = .05. For the combination lock experiments, we set L = 512, and use the (intentionally loose) upper bounds N = ˆD = K + 10 (= |S| + 10) and ˆtmix = 40. For the multi-maze environment, we use N = ˆD = 80 (> |S| = 68), and ˆtmix = 300. See Appendix D for how we chose the (loose) bounds ˆtmix tmix. For STEEL, we set N = ˆD = 2q (to match the maximum number of states in Double Prime(p, q) or Single Prime(p, q)); ˆtmix = 1 (because neither Double Prime(p, q) nor Single Prime(p, q) have time-correlated noise); ϵ = 0.49 (because, due to the simple emission distributions Q of Single Prime and Double Prime, a 51% encoder accuracy on each latent state implies a 100% encoder accuracy); and δ = 0.05. |