See, Hear, Explore: Curiosity via Audio-Visual Association
Authors: Victoria Dean, Shubham Tulsiani, Abhinav Gupta
NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We present results on several Atari environments and Habitat (a photorealistic navigation simulator), showing the beneļ¬ts of using an audio-visual association model for intrinsically guiding learning agents in the absence of external rewards. |
| Researcher Affiliation | Collaboration | Victoria Dean Carnegie Mellon University vdean@cs.cmu.edu Shubham Tulsiani Facebook AI Research shubhtuls@fb.com Abhinav Gupta Carnegie Mellon University Facebook AI Research abhinavg@cs.cmu.edu |
| Pseudocode | No | The paper describes the methodology in text and with diagrams (e.g., Figure 2) but does not include formal pseudocode or algorithm blocks. |
| Open Source Code | Yes | For videos and code, see https://vdean.github.io/audio-curiosity.html. |
| Open Datasets | No | The paper describes the use of Atari games and the Habitat simulator as environments for training agents. While these environments are generally accessible, the paper does not provide specific links, DOIs, or formal citations for a 'dataset' in the sense of a static collection of data with explicit public access information for reproducibility of data collection. |
| Dataset Splits | No | The paper describes training agents within environments (Atari and Habitat) for a specified number of frames but does not mention explicit training, validation, or test dataset splits in the context of a static dataset. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used for running the experiments, such as GPU or CPU models, or memory specifications. |
| Software Dependencies | No | The paper mentions software like Gym Retro and Habitat, and algorithms like PPO, but does not provide specific version numbers for these or other ancillary software components used in the experiments. |
| Experiment Setup | Yes | During training, the agent policy is rolled out in parallel environments. These yield trajectories which are each chunked into 128 time steps. [...] To compute audio features, we take an audio clip spanning 4 time steps (1/15th of a second for these 60 frame per second environments) and apply a Fast Fourier Transform (FFT). The FFT output is downsampled using max pooling to a 512-dimensional feature vector, which is used as input to the discriminator along with a 512-dimensional visual feature vector. We trained our approach and baselines for 200 million frames using the intrinsic reward and measure performance by the extrinsic reward throughout learning. |