See, Hear, Explore: Curiosity via Audio-Visual Association

Authors: Victoria Dean, Shubham Tulsiani, Abhinav Gupta

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We present results on several Atari environments and Habitat (a photorealistic navigation simulator), showing the benefits of using an audio-visual association model for intrinsically guiding learning agents in the absence of external rewards.
Researcher Affiliation Collaboration Victoria Dean Carnegie Mellon University vdean@cs.cmu.edu Shubham Tulsiani Facebook AI Research shubhtuls@fb.com Abhinav Gupta Carnegie Mellon University Facebook AI Research abhinavg@cs.cmu.edu
Pseudocode No The paper describes the methodology in text and with diagrams (e.g., Figure 2) but does not include formal pseudocode or algorithm blocks.
Open Source Code Yes For videos and code, see https://vdean.github.io/audio-curiosity.html.
Open Datasets No The paper describes the use of Atari games and the Habitat simulator as environments for training agents. While these environments are generally accessible, the paper does not provide specific links, DOIs, or formal citations for a 'dataset' in the sense of a static collection of data with explicit public access information for reproducibility of data collection.
Dataset Splits No The paper describes training agents within environments (Atari and Habitat) for a specified number of frames but does not mention explicit training, validation, or test dataset splits in the context of a static dataset.
Hardware Specification No The paper does not provide specific details about the hardware used for running the experiments, such as GPU or CPU models, or memory specifications.
Software Dependencies No The paper mentions software like Gym Retro and Habitat, and algorithms like PPO, but does not provide specific version numbers for these or other ancillary software components used in the experiments.
Experiment Setup Yes During training, the agent policy is rolled out in parallel environments. These yield trajectories which are each chunked into 128 time steps. [...] To compute audio features, we take an audio clip spanning 4 time steps (1/15th of a second for these 60 frame per second environments) and apply a Fast Fourier Transform (FFT). The FFT output is downsampled using max pooling to a 512-dimensional feature vector, which is used as input to the discriminator along with a 512-dimensional visual feature vector. We trained our approach and baselines for 200 million frames using the intrinsic reward and measure performance by the extrinsic reward throughout learning.