Generalisation in Lifelong Reinforcement Learning through Logical Composition

Authors: Geraud Nangue Tasse, Steven James, Benjamin Rosman

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We verify our approach in a series of experiments, where we perform transfer learning both after learning a set of base tasks, and after learning an arbitrary set of tasks. We also demonstrate that, as a side effect of our transfer learning approach, an agent can produce an interpretable Boolean expression of its understanding of the current task. Finally, we demonstrate our approach in the full lifelong setting where an agent receives tasks from an unknown distribution. Starting from scratch, an agent is able to quickly generalise over the task distribution after learning only a few tasks, which are sub-logarithmic in the size of the task space.
Researcher Affiliation Academia Geraud Nangue Tasse, Steven James & Benjamin Rosman School of Computer Science and Applied Mathematics University of the Witwatersrand Johannesburg, South Africa geraudnt@gmail.com, {steven.james, benjamin.rosman1}@wits.ac.za
Pseudocode Yes Algorithm 1 shows the full pseudo-code for SOPGOL.
Open Source Code No No explicit statement about the release of source code or a direct link to a code repository for the described methodology was found.
Open Datasets No The paper mentions the PICKUPOBJ domain from the MINIGRID environment (Chevalier-Boisvert et al., 2018) and Four Rooms domain (Sutton et al., 1999), but does not provide specific access information (links, DOIs, or explicit statements of public availability) for the exact datasets used beyond citing the environments.
Dataset Splits No The paper does not explicitly provide training/validation/test dataset splits. It mentions using 'test tasks' and 'pretraining' but no specific split percentages or sample counts.
Hardware Specification No The paper acknowledges computing resources from the Centre for High Performance Computing (CHPC) and the Mathematical Sciences Support unit at the University of the Witwatersrand, but does not provide specific hardware details such as CPU/GPU models or memory.
Software Dependencies No The paper mentions using deep Q-learning (Mnih et al., 2015) and Q-learning (Watkins, 1989) as RL methods. It also mentions 'gym-minigrid' environment. However, it does not provide specific version numbers for these software components or any other libraries/frameworks.
Experiment Setup Yes We used the ADAM optimiser with batch size 256 and a learning rate of 10 3. We started training after 1000 steps of random exploration and updated the target Q-network every 1000 steps. Finally, we used ϵ-greedy exploration, annealing ϵ from 0.5 to 0.05 over 100000 timesteps.