Know Your Action Set: Learning Action Relations for Reinforcement Learning

Authors: Ayush Jain, Norio Kosaka, Kyung-Min Kim, Joseph J Lim

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate AGILE on three varying action set scenarios requiring learning action interdependence: (i) shortcut actions in goal-reaching, which can shorten the optimal path to the goal in a 2D Grid World when available, (ii) co-dependent actions in tool reasoning, which require other tools to activate their functionality, and (iii) list-actions in simulated and real-data recommender systems where the cumulative list affects the user response. Figure 3 provides an overview of the tasks, the base and varying action space, and an illustration of the action interdependence. More environment details such as tasks, action representations, and data collection are present in Appendix A.
Researcher Affiliation Collaboration Ayush Jain 1 Norio Kosaka 2 Kyung-Min Kim2 4 Joseph J. Lim3 4 1University of Southern California (USC), 2NAVER CLOVA, 3Korea Advanced Institute of Science and Technology (KAIST), 4NAVER AI Lab
Pseudocode Yes Algorithm 1 Cascaded DQN: Listwise Action RL
Open Source Code Yes Code: https://github.com/clvrai/agile
Open Datasets Yes We use Rec Sim (Ie et al., 2019a) to simulate user interactions and extend it to the listwise recommendation task.
Dataset Splits Yes We split the general tool space into 1098 tools for training, 507 tools for validation, and 507 tools for testing.
Hardware Specification Yes Each experiment seed takes about 4 hours for Grid Navigation, 60 hours for CREATE, 8 hours for Rec Sim, and 15 hours for Real-Data Recommender Systems, to converge. We use the Weights & Biases tool (Biewald, 2020) for logging and tracking experiments. All the environments were developed using the Open AI Gym interface (Brockman et al., 2016). For training Grid Navigation and CREATE environments, we use the PPO (Schulman et al., 2017) implementation based on Kostrikov (2018). For the recommender system environments, we use DQN (Mnih et al., 2015). We use the Adam optimizer (Kingma & Ba, 2014) throughout. We attach the code with details to reproduce all the experiments, except the real-data recommender system.
Software Dependencies Yes We used Py Torch (Paszke et al., 2019) for our implementation, and the experiments were primarily conducted on workstations with either NVIDIA Ge Force RTX 2080 Ti, P40, or V100 GPUs on Naver Smart Machine Learning platform (NSML) (Kim et al., 2018).
Experiment Setup Yes The hyperparameters for the additional components introduced in AGILE, baselines, and ablations are shown in Table 3. The environment-specific and RL algorithm hyperparameters are described in Table 4.