Reinforcement Learning of Implicit and Explicit Control Flow Instructions

Authors: Ethan Brooks, Janarthanan Rajendran, Richard L Lewis, Satinder Singh

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We test the architecture s ability to learn both explicit and implicit control in two illustrative domains one inspired by Minecraft and the other by Star Craft and show that the architecture exhibits zero-shot generalization to novel instructions of length greater than those in a training set, at a performance level unmatched by three baseline recurrent architectures and one ablation architecture. 4. Experiments In this section we present results from three generalization experiments in two instruction-following domains, in which agents are trained on short instructions and evaluated on longer instructions or instructions containing unseen combinations of explicit control-flow blocks.
Researcher Affiliation Academia 1 Department of Computer Science, University of Michigan 2 Weinberg Institute for Cognitive Science, Departments of Psychology and Linguistics, University of Michigan.
Pseudocode Yes Figure 1. (Left) Depicts the flow of information every time step from memory M, pointer pt, and observation xt to actions at and pointer movements dt. (Right) Pointer update pseudocode.
Open Source Code Yes 1Source code may be accessed from https://github. com/ethanabrooks/Co FCA-S
Open Datasets No The paper describes how the environment and instructions are generated for the Star Craft-inspired and Minecraft-inspired domains (“randomly generated for each episode”, “randomly sampled from a generative grammar”), but does not provide concrete access information (link, DOI, formal citation) to a pre-existing or released public dataset used for training or evaluation. The data is generated dynamically.
Dataset Splits No The paper describes evaluation periods (“Every million frames, we evaluated agent performance on 150 complete episodes”) and hyperparameter tuning, but does not explicitly provide specific training/validation/test dataset splits with percentages or sample counts, or refer to predefined splits for a fixed dataset, as the environments are dynamically generated.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, memory amounts) used for running its experiments, only general statements about training environments.
Software Dependencies No The paper mentions algorithms and network components (e.g., PPO, GRU, ReLU) but does not provide specific software dependencies or library versions (e.g., PyTorch version, TensorFlow version, Python version) needed for replication.
Experiment Setup No The paper mentions that hyperparameters like hidden size, kernel size, stride, entropy coefficient values, number of distributions L, and learning rate were tuned, but it does not provide their specific values or detailed configuration settings for the experimental setup in the main text.