reproducibilityindex.ai

Generalization to New Actions in Reinforcement Learning

Authors: Ayush Jain, Andrew Szot, Joseph Lim

ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We benchmark generalization on sequential tasks, such as selecting from an unseen tool-set to solve physical reasoning puzzles and stacking towers with novel 3D shapes. Videos and code are available at https://sites.google.com/ view/action-generalization. Our main contribution is introducing the problem of generalization to new actions. We propose four new environments to benchmark this setting. Our experiments aim to answer the following questions about the proposed problem and framework: (1) Can the HVAE extract meaningful action characteristics from the action observations? (2) What are the contributions of the proposed action encoder, policy architecture, and regularizations for generalization to new actions? (3) How well does our framework generalize to varying difﬁculties of test actions and types of action observations? (4) How inefﬁcient is ﬁnetuning to a new action space as compared to zero-shot generalization?
Researcher Affiliation	Academia	Ayush Jain * 1 Andrew Szot * 1 Joseph J. Lim 1 1Department of Computer Science, University of Southern California, California, USA. Correspondence to: Ayush Jain <ayushj@usc.edu>, Andrew Szot <szot@usc.edu>.
Pseudocode	Yes	Algorithm 1. Two-stage Training Framework. Algorithm 2. Generalization to New Actions.
Open Source Code	Yes	Videos and code are available at https://sites.google.com/ view/action-generalization. Complete code available at https://github.com/ clvrai/new-actions-rl
Open Datasets	Yes	We propose four sequential decision-making environments with diverse actions to evaluate and benchmark the proposed problem of generalization to new actions. In each environment, the train-test-validation split is approximately 50-25-25%. Complete details on each environment, action observations, and train-validation-test splits can be found in Appendix A. CREATE environment: https://clvrai.com/create. Complete code available at https://github.com/ clvrai/new-actions-rl
Dataset Splits	Yes	In each environment, the train-test-validation split is approximately 50-25-25%. Validation-based model selection: During training, the models are evaluated on held-out validation sets of actions, and the best performing model is selected. Perform hyperparameter tuning and model selection by evaluating on the validation action set.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., exact GPU/CPU models, memory amounts, or detailed computer specifications) used for running its experiments.
Software Dependencies	No	The paper mentions optimizers like RAdam, PPO, and Adam, and networks like Bi-LSTM, but does not provide specific version numbers for any libraries, programming languages (e.g., Python), or software environments.
Experiment Setup	Yes	Subsampled action spaces: To limit the actions available in each episode of training, we randomly subsample action sets, A A of size m, a hyperparameter. Maximum entropy regularization: We further diversify the policy s actions during training using the maximum entropy objective (Ziebart et al., 2008). We add the entropy of the policy H[πθ(a\|s)] to the RL objective with a hyperparameter weighting β. Note that the validation set is also used to tune hyperparameters such as entropy coefﬁcient β and subsampled action set size m. The HVAE is trained using RAdam optimizer (Liu et al., 2019), and we use PPO (Schulman et al., 2017) to train the policy with Adam Optimizer (Kingma & Ba, 2015). Additional implementation and experimental details, including the hyperparameters searched, are provided in Appendix D.