Robust Subtask Learning for Compositional Generalization
Authors: Kishor Jothimurugan, Steve Hsu, Osbert Bastani, Rajeev Alur
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate our approach on two multi-task environments with continuous states and actions and demonstrate that our algorithms outperform state-of-the-art baselines. |
| Researcher Affiliation | Academia | 1University of Pennsylvania. Correspondence to: Kishor Jothimurugan <kishor@seas.upenn.edu>. |
| Pseudocode | Yes | Algorithm 1 Asynchronous value iteration algorithm for computing optimal subtask policies. Algorithm 2 Robust Option Soft Actor Critic. Algorithm 3 Asynchronous Robust Option SAC. |
| Open Source Code | Yes | Our implementation is available online and can be found at https://github.com/keyshor/rosac. |
| Open Datasets | No | The paper mentions the "F1/10th environment" and cites "F110. F1/10 Autonomous Racing Competition. http://f1tenth.org" which is a simulator, not a specific training dataset with access details. The "Rooms environment" appears to be custom-built and no access is provided. |
| Dataset Splits | No | The paper describes evaluation against adversaries and mentions subtask sequences, but it does not specify explicit numerical training/validation/test dataset splits (e.g., percentages or sample counts). |
| Hardware Specification | No | All experiments were run on a 48-core machine with 512GB of memory and 8 GPUs. This is a general description, but it does not specify the exact models of the GPUs or CPUs. |
| Software Dependencies | No | The paper mentions specific optimizers and algorithms (e.g., Adam optimizer, SAC, DDPG, PPO, REINFORCE) but does not provide specific version numbers for any of these software components or underlying libraries (e.g., PyTorch, TensorFlow). |
| Experiment Setup | Yes | The hidden dimension used is 64 for all approaches except MADDPG for which we use 128 dimensional hidden layers. For DAGGER, NAIVE and AROSAC we run SAC with Adam optimizer (learning rate of α = 0.01), entropy weight β = 0.05, Polyac rate 0.005 and batch size of 100. In each iteration of AROSAC and DAGGER, SAC is run for N = 10000 steps. Similarly, ROSAC is run with Adam optimizer (learning rates αψ = αθ = 0.01), entropy weight β = 0.05, Polyac rate 0.005 and batch size of 300. The MADDPG baseline uses a learning rate of 0.0003 and batch size of 256. PAIRED uses PPO with a learning rate of 0.02, batch size of 512, minibatch size of 128 and 4 epochs for each policy update. The adversary is trained using REINFORCE with a learning rate of 0.003. |