Zero-Shot Transfer with Deictic Object-Oriented Representation in Reinforcement Learning

Authors: Ofir Marom, Benjamin Rosman

NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conducted two sets of experiments on this domain. In the first set we have one destination and we fix the number of passengers, n. We generate a grounded MDP with an initial state by randomly sampling n passenger locations and one destination location from one of six pre-specified locations and we also sample a random taxi start location together with one of four wall configurations as shown in Figure 1a. We apply 20 independent runs of the following procedure: we sample 10 test MDPs with random initial states. We then randomly sample a training MDP and run DOORMAXD on it for one episode until we reach the terminal state.
Researcher Affiliation Academia 1University of the Witwatersrand, Johannesburg, South Africa 2Council for Scientific and Industrial Research, Pretoria, South Africa
Pseudocode Yes Algorithm 1: DOORMAXD: learning procedure for C.α and a.
Open Source Code No The paper does not contain any explicit statements or links indicating the availability of open-source code for the described methodology.
Open Datasets No The paper describes using the 'all-passenger any-destination Taxi domain' and the 'Sokoban domain', and generating instances from these. However, it does not provide concrete access information (link, DOI, specific repository, or formal citation with authors/year for specific dataset instances) to make these generated datasets publicly available or reproducible by others without re-implementing the generation process.
Dataset Splits No The paper mentions sampling '10 test MDPs' and 'a training MDP' but does not specify a separate validation split, nor does it provide percentages or exact counts for any validation set.
Hardware Specification No The paper does not provide any specific details about the hardware used for running the experiments (e.g., GPU models, CPU types, memory specifications).
Software Dependencies No The paper does not list specific software dependencies with version numbers, such as programming languages, libraries, or frameworks (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup Yes We apply 20 independent runs of the following procedure: we sample 10 test MDPs with random initial states. We then randomly sample a training MDP and run DOORMAXD on it for one episode until we reach the terminal state. Upon termination, we test performance by running DOORMAXD for one episode on each of the 10 test MDPs, stopping an episode early if we exceed 500 steps. We repeat this for 100 training MDPs. Since all the MDPs come from the same schema we can share transition dynamics between our MDPs but we only update the transition dynamics on training MDPs. In our experiments we start with n = 1 passenger and increase to n = 4 passengers. We run our experiments for Propositional OO-MDPs and two versions of Deictic OO-MDPs.