Learning Generalizable Visual Representations via Interactive Gameplay

Authors: Luca Weihs, Aniruddha Kembhavi, Kiana Ehsani, Sarah M Pratt, Winson Han, Alvaro Herrasti, Eric Kolve, Dustin Schwenk, Roozbeh Mottaghi, Ali Farhadi

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our first set of experiments show that our agents develop low-level visual understanding of individual images measured by their capacity to perform a collection of standard tasks from the computer vision literature, these tasks include pixel-to-pixel depth (Saxena et al., 2006) and surface normal (Fouhey et al., 2013) prediction, from a single image. Our experiments are designed to address three questions: (1) has our cache agent learned to proficiently hide and seek objects? , (2) how do the SIRs learned by playing cache compare to those learned using standard supervised approaches and when training using other interactive tasks? , and (3) has the cache agent learned to integrate observations over time to produce general DIR representations? .
Researcher Affiliation Collaboration 1Allen Institute for Artificial Intelligence 2University of Washington
Pseudocode Yes Algorithm 1: Exploration and mapping episode reward structure; Algorithm 2: Object hiding episode reward structure; Algorithm 3: Object manipulation episode reward structure; Algorithm 4: Seeking episode reward structure
Open Source Code No No concrete access to source code was provided. The paper states, 'We direct anyone interested in exact reproduction of our VDR procedure to our code base.' and 'To fully reproduce our results please see our code base.', but no specific link or repository name is given for public access.
Open Datasets Yes For this we leverage AI2-THOR (Kolve et al., 2017), a near photo-realistic, interactive, simulated, 3D environment of indoor living spaces, see Fig. 1a. We compare against a fully supervised model trained on Image Net (Deng et al., 2009). for SUN scene classif. (Xiao et al., 2010) as well as the NYU V2 depth (Nathan Silberman & Fergus, 2012) and walkability tasks (Mottaghi et al., 2016)
Dataset Splits Yes Excluding foyers, which are reserved for our dynamic image representation experiments and used nowhere else, we consider the first 20 scenes of each scene type to be train scenes, the next five of each type to be validation scenes, and the last five of each type to be test scenes.
Hardware Specification No No specific hardware models (e.g., GPU/CPU models) were mentioned. The paper only states: 'We train our cache agent using eight GPUs with one GPU reserved for running AI2-THOR processes, one reserved for VDR, and the other six dedicated to training with reinforcement and self-supervised learning.'
Software Dependencies Yes We use the ADAM optimizer (Kingma & Ba, 2015) with AMSGrad (Reddi et al., 2018), moving average parameters β1 = 0.99, β2 = 0.999, a learning rate of 10−3 for VDR, and varying learning rates for the different cache stages (10−4 for the E&M, OH, OM, and S stages, 5 × 10−4 for the PS-stage). We run our analysis using the R (R Core Team, 2019) programming language, in particular we use the glmm PQL function in the nlme (Pinheiro et al., 2019) package to fit our GLMM models and then the emmeans (Lenth, 2019) package to obtain p-values of contrasts between fixed effects.
Experiment Setup Yes For the A3C loss we let the discounting parameter γ = 0.99 (except for in the OH-stage where γ = 0.8), the entropy weight β = 0.01, and GAE parameter τ = 1. We use the ADAM optimizer (Kingma & Ba, 2015) with AMSGrad (Reddi et al., 2018), moving average parameters β1 = 0.99, β2 = 0.999, a learning rate of 10−3 for VDR, and varying learning rates for the different cache stages (10−4 for the E&M, OH, OM, and S stages, 5 × 10−4 for the PS-stage).