Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Predicting Future Actions of Reinforcement Learning Agents
Authors: Stephen Chung, Scott Niekum, David Krueger
NeurIPS 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | This paper experimentally evaluates and compares the effectiveness of future action and event prediction for three types of RL agents: explicitly planning, implicitly planning, and non-planning. We conduct extensive experiments to address the above research questions. |
| Researcher Affiliation | Academia | Stephen Chung University of Cambridge Scott Niekum University of Massachusetts Amherst David Krueger Mila |
| Pseudocode | No | The paper describes algorithms and methods but does not include any explicitly labeled pseudocode or algorithm blocks with structured, code-like formatting. |
| Open Source Code | Yes | 1Full code is available at https://github.com/stephen-chung-mh/predict_action. Code: The code used for these experiments is available at https://github.com/ stephen-chung-mh/predict_action and is based on the public code released in Thinker [6]. |
| Open Datasets | No | The paper uses the Sokoban environment and describes how data was generated (50k transitions). While Sokoban is a known game, there is no specific URL, DOI, or citation provided for a pre-existing, publicly accessible dataset of Sokoban levels or generated transitions used for training. |
| Dataset Splits | Yes | We generate 50,000 training samples, 10,000 evaluation samples, and 10,000 testing samples using the trained agents. |
| Hardware Specification | Yes | Computational Resources: Each agent is trained using a single A100 GPU, with training time varying by algorithm. |
| Software Dependencies | No | The paper mentions general components like 'convolutional network', 'three-layer Transformer encoder', and 'Adam optimizer', but does not provide specific version numbers for software libraries, frameworks (like PyTorch or TensorFlow), or programming languages used. |
| Experiment Setup | Yes | We evaluate the performance of predictors with varying training data sizes: 1k, 2k, 5k, 10k, 20k, 50k. We utilize a batch size of 128 and an Adam optimizer with a learning rate of 0.0001. Training is halted when the validation loss fails to improve for 10 consecutive steps. |