IR-VIC: Unsupervised Discovery of Sub-goals for Transfer in RL
Authors: Nirbhay Modhe, Prithvijit Chattopadhyay, Mohit Sharma, Abhishek Das, Devi Parikh, Dhruv Batra, Ramakrishna Vedantam
IJCAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 3 ExperimentsEnvironments. We pre-train and test on grid-worlds from the Mini Grid [Chevalier-Boisvert et al., 2018] environments. We first consider a set of simple environments 4-Room and Maze (see Fig. 3) followed by the Multi Room NXSY also used by [Goyal et al., 2019]. The Multi Room NXSY environments consist of X rooms of size Y, connected in random orientations. We refer to the ordering of rooms, doors and goal as a layout in the Multi Room NXSY environment pre-training of options (for IR-VIC and DIAYN) is performed on a single fixed layout while transfer is performed on several different layouts (a layout is randomly selected from a set every time the environment is reset). In all pre-training environments, we fix the option trajectory length H (the number of steps an option takes before termination) to 30 steps. We use Advantage Actor-Critic (A2C) for all experiments. Since code for Info Bot [Goyal et al., 2019] was not public, we report numbers based on a re-implementation of Info Bot, ensuring consistency with their architectural and hyperparameter choices. We refer the readers to our code2 for further details. Baselines. We evaluate the following on quality of exploration and transfer to downstream goal-driven tasks with sparse rewards: 1) Info Bot (our implementation) which identifies goal-driven decision states by regularizing goal information, 2) DIAYN whose focus is unsupervised skill acquisition, but has an I(At, Ω|St) term which can be used for the bonus in Equation 8, 3) count-based exploration which uses visitation counts as exploration incentive (this corresponds to replacing I(Ω, Zt|St, S0) with 1 in Equation 8), 4) a randomly initialized encoder p(zt | ω, xt) which is a noisy version of the count-based baseline where the scale of the reward is adjusted to match the count-based baseline 5) how different values of β affect performance and how we choose a β value using a validation set, and 6) a heuristic baseline that uses domain knowledge to identify landmarks such as corners and doorways and provide a higher count-based exploration bonus to these states. This validates the extent to which necessary option information is useful in identifying a sparse set of states that are useful for transfer vs. heuristically determined landmarks. |
| Researcher Affiliation | Collaboration | Nirbhay Modhe1 , Prithvijit Chattopadhyay1 , Mohit Sharma1 , Abhishek Das1 , Devi Parikh1,2 , Dhruv Batra1,2 and Ramakrishna Vedantam2 1Georgia Institute of Technology 2Facebook AI Research {nirbhaym,prithvijit3,mohit.sharma,abhshkdz,parikh,dbatra}@gatech.edu, ramav@fb.com |
| Pseudocode | No | The paper describes the algorithmic steps in text but does not include a formal pseudocode block or algorithm listing. |
| Open Source Code | Yes | We refer the readers to our code2 for further details. 2https://github.com/nirbhayjm/irvic |
| Open Datasets | Yes | Environments. We pre-train and test on grid-worlds from the Mini Grid [Chevalier-Boisvert et al., 2018] environments. ... [Chevalier-Boisvert et al., 2018] Maxime Chevalier Boisvert, Lucas Willems, and Suman Pal. Minimalistic gridworld environment for openai gym. https://github.com/maximecb/gym-minigrid, 2018. |
| Dataset Splits | Yes | We pick the best model for transfer based on performance on the validation environments, and study generalization to novel test environments. Choosing the value of β in this setting is thus akin to model selection. Such design choices are inherent in general in unsupervised representation learning (e.g. with K-means and β-VAE [Higgins et al., 2017]). |
| Hardware Specification | No | The paper does not provide specific details about the hardware used (e.g., GPU/CPU models, memory). |
| Software Dependencies | No | The paper does not provide specific version numbers for software dependencies. |
| Experiment Setup | Yes | In all pre-training environments, we fix the option trajectory length H (the number of steps an option takes before termination) to 30 steps. We use Advantage Actor-Critic (A2C) for all experiments. ... We fix the coefficient for maximum-entropy, α = 10 3 which consistently works well for our approach as well as baselines. ... We sweep over β in log-scale from {10 1, . . . , 10 6}... |