Subwords as Skills: Tokenization for Sparse-Reward Reinforcement Learning
Authors: David Yunis, Justin Jung, Falcon Dai, Matthew Walter
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We present a novel method to extract skills from demonstrations for use in sparse-reward RL... We show strong performance in a variety of tasks... Figure 1: A sample of some skills that our method identifies for the (a) Ant Maze and (b) Kitchen environments... 4 Experiments: In the following sections, we demonstrate the empirical performance of our proposed method... |
| Researcher Affiliation | Collaboration | David Yunis TTI-Chicago Chicago, IL dyunis@ttic.edu Justin Jung Springtail.ai San Francisco, CA justin@springtail.ai Falcon Z. Dai Symbolica AI San Francisco, CA falcon@symbolica.ai Matthew R. Walter TTI-Chicago Chicago, IL mwalter@ttic.edu |
| Pseudocode | Yes | Algorithm 1 Skill-extraction with BPE |
| Open Source Code | Yes | Our code is available at https://github.com/dyunis/subwords_as_skills. Code is available for our experiments at https://github.com/dyunis/subwords_as_skills. It is available at https://github.com/dyunis/subwords_as_skills. |
| Open Datasets | Yes | Tasks: We consider online RL on Ant Maze and Kitchen from D4RL [23], two very challenging sparse-reward state-based environments. We also consider Coin Run [17], a discrete-action platforming game. [23] J. Fu, A. Kumar, O. Nachum, G. Tucker, and S. Levine. D4RL: Datasets for deep data-driven reinforcement learning. ar Xiv preprint ar Xiv:2004.07219, 2020. [17] K. Cobbe, O. Klimov, C. Hesse, T. Kim, and J. Schulman. Quantifying generalization in reinforcement learning. In International Conference on Machine Learning, pages 1282 1289, 2019. |
| Dataset Splits | No | The paper does not explicitly specify percentages or counts for training, validation, or test data splits. It mentions using "demonstrations" for skill extraction and then training an RL agent in environments, but formal data partitioning for validation is not detailed. |
| Hardware Specification | Yes | Methods measured on the same Nvidia RTX 3090 GPU with 8 Intel Core i7-9700 3 GHz CPUs @ 3.00 GHz. All experiments were performed on an internal cluster with access to around 100 Nvidia 2080 Ti (or more capable) GPUs. |
| Software Dependencies | No | Code was implemented in Python using Py Torch [55] for deep learning, Stable Baselines 3 [63] for RL, and Weights & Biases [10] for logging. The paper mentions software names but does not provide specific version numbers for them. |
| Experiment Setup | Yes | For our RL agent, we use SAC-discrete [16]. Both critics as well as the policy are optimized with Adam [35] with a standard learning rate of 3e 4. Replay buffer size is set to the standard 1 million transitions. We update both critics and the policy every step of environment interaction and sample uniformly from the replay buffer to do so... We choose a target entropy dependent on the domain: 0.1 for Ant Mazes, 0 for Kitchen, and 0.5 for Coin Run... we found a large batch size crucial to good performance in Ant Maze, where we use a batch size of 4096. For other tasks, we use a batch size of 64. For Ant Mazes and Kitchen, we choose defaults of k = 2 dact, L = 10, Nmax = 106 and Nmin = 16. For Coin Run there is no need for discretization, so we only choose Nmax = 106, L = 10, and Nmin = 16. |