Generalization to New Actions in Reinforcement Learning
Authors: Ayush Jain, Andrew Szot, Joseph Lim
ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We benchmark generalization on sequential tasks, such as selecting from an unseen tool-set to solve physical reasoning puzzles and stacking towers with novel 3D shapes. Videos and code are available at https://sites.google.com/ view/action-generalization. Our main contribution is introducing the problem of generalization to new actions. We propose four new environments to benchmark this setting. Our experiments aim to answer the following questions about the proposed problem and framework: (1) Can the HVAE extract meaningful action characteristics from the action observations? (2) What are the contributions of the proposed action encoder, policy architecture, and regularizations for generalization to new actions? (3) How well does our framework generalize to varying difficulties of test actions and types of action observations? (4) How inefficient is finetuning to a new action space as compared to zero-shot generalization? |
| Researcher Affiliation | Academia | Ayush Jain * 1 Andrew Szot * 1 Joseph J. Lim 1 1Department of Computer Science, University of Southern California, California, USA. Correspondence to: Ayush Jain <ayushj@usc.edu>, Andrew Szot <szot@usc.edu>. |
| Pseudocode | Yes | Algorithm 1. Two-stage Training Framework. Algorithm 2. Generalization to New Actions. |
| Open Source Code | Yes | Videos and code are available at https://sites.google.com/ view/action-generalization. Complete code available at https://github.com/ clvrai/new-actions-rl |
| Open Datasets | Yes | We propose four sequential decision-making environments with diverse actions to evaluate and benchmark the proposed problem of generalization to new actions. In each environment, the train-test-validation split is approximately 50-25-25%. Complete details on each environment, action observations, and train-validation-test splits can be found in Appendix A. CREATE environment: https://clvrai.com/create. Complete code available at https://github.com/ clvrai/new-actions-rl |
| Dataset Splits | Yes | In each environment, the train-test-validation split is approximately 50-25-25%. Validation-based model selection: During training, the models are evaluated on held-out validation sets of actions, and the best performing model is selected. Perform hyperparameter tuning and model selection by evaluating on the validation action set. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., exact GPU/CPU models, memory amounts, or detailed computer specifications) used for running its experiments. |
| Software Dependencies | No | The paper mentions optimizers like RAdam, PPO, and Adam, and networks like Bi-LSTM, but does not provide specific version numbers for any libraries, programming languages (e.g., Python), or software environments. |
| Experiment Setup | Yes | Subsampled action spaces: To limit the actions available in each episode of training, we randomly subsample action sets, A A of size m, a hyperparameter. Maximum entropy regularization: We further diversify the policy s actions during training using the maximum entropy objective (Ziebart et al., 2008). We add the entropy of the policy H[πθ(a|s)] to the RL objective with a hyperparameter weighting β. Note that the validation set is also used to tune hyperparameters such as entropy coefficient β and subsampled action set size m. The HVAE is trained using RAdam optimizer (Liu et al., 2019), and we use PPO (Schulman et al., 2017) to train the policy with Adam Optimizer (Kingma & Ba, 2015). Additional implementation and experimental details, including the hyperparameters searched, are provided in Appendix D. |