Generalisation in Lifelong Reinforcement Learning through Logical Composition
Authors: Geraud Nangue Tasse, Steven James, Benjamin Rosman
ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We verify our approach in a series of experiments, where we perform transfer learning both after learning a set of base tasks, and after learning an arbitrary set of tasks. We also demonstrate that, as a side effect of our transfer learning approach, an agent can produce an interpretable Boolean expression of its understanding of the current task. Finally, we demonstrate our approach in the full lifelong setting where an agent receives tasks from an unknown distribution. Starting from scratch, an agent is able to quickly generalise over the task distribution after learning only a few tasks, which are sub-logarithmic in the size of the task space. |
| Researcher Affiliation | Academia | Geraud Nangue Tasse, Steven James & Benjamin Rosman School of Computer Science and Applied Mathematics University of the Witwatersrand Johannesburg, South Africa geraudnt@gmail.com, {steven.james, benjamin.rosman1}@wits.ac.za |
| Pseudocode | Yes | Algorithm 1 shows the full pseudo-code for SOPGOL. |
| Open Source Code | No | No explicit statement about the release of source code or a direct link to a code repository for the described methodology was found. |
| Open Datasets | No | The paper mentions the PICKUPOBJ domain from the MINIGRID environment (Chevalier-Boisvert et al., 2018) and Four Rooms domain (Sutton et al., 1999), but does not provide specific access information (links, DOIs, or explicit statements of public availability) for the exact datasets used beyond citing the environments. |
| Dataset Splits | No | The paper does not explicitly provide training/validation/test dataset splits. It mentions using 'test tasks' and 'pretraining' but no specific split percentages or sample counts. |
| Hardware Specification | No | The paper acknowledges computing resources from the Centre for High Performance Computing (CHPC) and the Mathematical Sciences Support unit at the University of the Witwatersrand, but does not provide specific hardware details such as CPU/GPU models or memory. |
| Software Dependencies | No | The paper mentions using deep Q-learning (Mnih et al., 2015) and Q-learning (Watkins, 1989) as RL methods. It also mentions 'gym-minigrid' environment. However, it does not provide specific version numbers for these software components or any other libraries/frameworks. |
| Experiment Setup | Yes | We used the ADAM optimiser with batch size 256 and a learning rate of 10 3. We started training after 1000 steps of random exploration and updated the target Q-network every 1000 steps. Finally, we used ϵ-greedy exploration, annealing ϵ from 0.5 to 0.05 over 100000 timesteps. |