Predicting Future Actions of Reinforcement Learning Agents
Authors: Stephen Chung, Scott Niekum, David Krueger
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | This paper experimentally evaluates and compares the effectiveness of future action and event prediction for three types of RL agents: explicitly planning, implicitly planning, and non-planning. We conduct extensive experiments to address the above research questions. |
| Researcher Affiliation | Academia | Stephen Chung University of Cambridge Scott Niekum University of Massachusetts Amherst David Krueger Mila |
| Pseudocode | No | The paper describes algorithms and methods but does not include any explicitly labeled pseudocode or algorithm blocks with structured, code-like formatting. |
| Open Source Code | Yes | 1Full code is available at https://github.com/stephen-chung-mh/predict_action. Code: The code used for these experiments is available at https://github.com/ stephen-chung-mh/predict_action and is based on the public code released in Thinker [6]. |
| Open Datasets | No | The paper uses the Sokoban environment and describes how data was generated (50k transitions). While Sokoban is a known game, there is no specific URL, DOI, or citation provided for a pre-existing, publicly accessible dataset of Sokoban levels or generated transitions used for training. |
| Dataset Splits | Yes | We generate 50,000 training samples, 10,000 evaluation samples, and 10,000 testing samples using the trained agents. |
| Hardware Specification | Yes | Computational Resources: Each agent is trained using a single A100 GPU, with training time varying by algorithm. |
| Software Dependencies | No | The paper mentions general components like 'convolutional network', 'three-layer Transformer encoder', and 'Adam optimizer', but does not provide specific version numbers for software libraries, frameworks (like PyTorch or TensorFlow), or programming languages used. |
| Experiment Setup | Yes | We evaluate the performance of predictors with varying training data sizes: 1k, 2k, 5k, 10k, 20k, 50k. We utilize a batch size of 128 and an Adam optimizer with a learning rate of 0.0001. Training is halted when the validation loss fails to improve for 10 consecutive steps. |