reproducibilityindex.ai

Know Your Action Set: Learning Action Relations for Reinforcement Learning

Authors: Ayush Jain, Norio Kosaka, Kyung-Min Kim, Joseph J Lim

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate AGILE on three varying action set scenarios requiring learning action interdependence: (i) shortcut actions in goal-reaching, which can shorten the optimal path to the goal in a 2D Grid World when available, (ii) co-dependent actions in tool reasoning, which require other tools to activate their functionality, and (iii) list-actions in simulated and real-data recommender systems where the cumulative list affects the user response. Figure 3 provides an overview of the tasks, the base and varying action space, and an illustration of the action interdependence. More environment details such as tasks, action representations, and data collection are present in Appendix A.
Researcher Affiliation	Collaboration	Ayush Jain 1 Norio Kosaka 2 Kyung-Min Kim2 4 Joseph J. Lim3 4 1University of Southern California (USC), 2NAVER CLOVA, 3Korea Advanced Institute of Science and Technology (KAIST), 4NAVER AI Lab
Pseudocode	Yes	Algorithm 1 Cascaded DQN: Listwise Action RL
Open Source Code	Yes	Code: https://github.com/clvrai/agile
Open Datasets	Yes	We use Rec Sim (Ie et al., 2019a) to simulate user interactions and extend it to the listwise recommendation task.
Dataset Splits	Yes	We split the general tool space into 1098 tools for training, 507 tools for validation, and 507 tools for testing.
Hardware Specification	Yes	Each experiment seed takes about 4 hours for Grid Navigation, 60 hours for CREATE, 8 hours for Rec Sim, and 15 hours for Real-Data Recommender Systems, to converge. We use the Weights & Biases tool (Biewald, 2020) for logging and tracking experiments. All the environments were developed using the Open AI Gym interface (Brockman et al., 2016). For training Grid Navigation and CREATE environments, we use the PPO (Schulman et al., 2017) implementation based on Kostrikov (2018). For the recommender system environments, we use DQN (Mnih et al., 2015). We use the Adam optimizer (Kingma & Ba, 2014) throughout. We attach the code with details to reproduce all the experiments, except the real-data recommender system.
Software Dependencies	Yes	We used Py Torch (Paszke et al., 2019) for our implementation, and the experiments were primarily conducted on workstations with either NVIDIA Ge Force RTX 2080 Ti, P40, or V100 GPUs on Naver Smart Machine Learning platform (NSML) (Kim et al., 2018).
Experiment Setup	Yes	The hyperparameters for the additional components introduced in AGILE, baselines, and ablations are shown in Table 3. The environment-speciﬁc and RL algorithm hyperparameters are described in Table 4.