Robust Subtask Learning for Compositional Generalization

Authors: Kishor Jothimurugan, Steve Hsu, Osbert Bastani, Rajeev Alur

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate our approach on two multi-task environments with continuous states and actions and demonstrate that our algorithms outperform state-of-the-art baselines.
Researcher Affiliation Academia 1University of Pennsylvania. Correspondence to: Kishor Jothimurugan <kishor@seas.upenn.edu>.
Pseudocode Yes Algorithm 1 Asynchronous value iteration algorithm for computing optimal subtask policies. Algorithm 2 Robust Option Soft Actor Critic. Algorithm 3 Asynchronous Robust Option SAC.
Open Source Code Yes Our implementation is available online and can be found at https://github.com/keyshor/rosac.
Open Datasets No The paper mentions the "F1/10th environment" and cites "F110. F1/10 Autonomous Racing Competition. http://f1tenth.org" which is a simulator, not a specific training dataset with access details. The "Rooms environment" appears to be custom-built and no access is provided.
Dataset Splits No The paper describes evaluation against adversaries and mentions subtask sequences, but it does not specify explicit numerical training/validation/test dataset splits (e.g., percentages or sample counts).
Hardware Specification No All experiments were run on a 48-core machine with 512GB of memory and 8 GPUs. This is a general description, but it does not specify the exact models of the GPUs or CPUs.
Software Dependencies No The paper mentions specific optimizers and algorithms (e.g., Adam optimizer, SAC, DDPG, PPO, REINFORCE) but does not provide specific version numbers for any of these software components or underlying libraries (e.g., PyTorch, TensorFlow).
Experiment Setup Yes The hidden dimension used is 64 for all approaches except MADDPG for which we use 128 dimensional hidden layers. For DAGGER, NAIVE and AROSAC we run SAC with Adam optimizer (learning rate of α = 0.01), entropy weight β = 0.05, Polyac rate 0.005 and batch size of 100. In each iteration of AROSAC and DAGGER, SAC is run for N = 10000 steps. Similarly, ROSAC is run with Adam optimizer (learning rates αψ = αθ = 0.01), entropy weight β = 0.05, Polyac rate 0.005 and batch size of 300. The MADDPG baseline uses a learning rate of 0.0003 and batch size of 256. PAIRED uses PPO with a learning rate of 0.02, batch size of 512, minibatch size of 128 and 4 epochs for each policy update. The adversary is trained using REINFORCE with a learning rate of 0.003.