Blocks Assemble! Learning to Assemble with Large-Scale Structured Reinforcement Learning

Authors: Seyed Kamyar Seyed Ghasemipour, Satoshi Kataoka, Byron David, Daniel Freeman, Shixiang Shane Gu, Igor Mordatch

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through extensive experiments, we highlight the importance of large-scale training, structured representations, contributions of multi-task vs. single-task learning, as well as the effects of curriculums, and discuss qualitative behaviors of trained agents.
Researcher Affiliation Industry 1Google Research. Correspondence to: Seyed Kamyar Seyed Ghasemipour <kamyar@google.com>.
Pseudocode Yes Appendix C. Graph Neural Network Architecture. (Contains Python code blocks for the network architecture)
Open Source Code No Our accompanying project webpage can be found at: sites.google.com/view/learning-direct-assembly. (Upon visiting the link, it states 'Code coming soon!', indicating the code is not yet publicly available at the time of checking.)
Open Datasets No To specify the assembly task, we designed 165 blueprints (split into 141 train, 24 test) describing interesting structures to be built... (The paper states the authors designed the blueprints, but does not provide a link or specific citation for public access to this dataset.)
Dataset Splits No To specify the assembly task, we designed 165 blueprints (split into 141 train, 24 test)... (The paper mentions train and test splits, but no explicit validation split for the dataset is described.)
Hardware Specification Yes Unless otherwise specified, our agents are trained for 1 Billion environment timesteps, using 1 Nvidia V100 GPU for training, and 3000 preemptible CPUs for generating rollouts in the environment.
Software Dependencies No The key libraries used for training are Jax (Bradbury et al., 2018), Jraph (Godwin* et al., 2020), Haiku (Hennigan et al., 2020), and Acme (Hoffman et al., 2020). (Software names are mentioned but specific version numbers for these libraries are not provided.)
Experiment Setup Yes We train our agents using Proximal Policy Optimization (PPO) (Schulman et al., 2017) and Generalized Advantage Estimation (GAE) (Schulman et al., 2015), and follow the practical PPO training advice of (Andrychowicz et al., 2020a). ... Unless otherwise specified, our agents are trained for 1 Billion environment timesteps... Episodes are 100 environment steps long... Unless otherwise specified, we reset from training blueprints with probability 0.2. (Appendix C also details network hyperparameters like NUM_GN_GAT_LAYERS = 3, VALUE_MLP_LAYER_SIZES = [512, 512, 512], etc.)