reproducibilityindex.ai

See, Hear, Explore: Curiosity via Audio-Visual Association

Authors: Victoria Dean, Shubham Tulsiani, Abhinav Gupta

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We present results on several Atari environments and Habitat (a photorealistic navigation simulator), showing the beneﬁts of using an audio-visual association model for intrinsically guiding learning agents in the absence of external rewards.
Researcher Affiliation	Collaboration	Victoria Dean Carnegie Mellon University vdean@cs.cmu.edu Shubham Tulsiani Facebook AI Research shubhtuls@fb.com Abhinav Gupta Carnegie Mellon University Facebook AI Research abhinavg@cs.cmu.edu
Pseudocode	No	The paper describes the methodology in text and with diagrams (e.g., Figure 2) but does not include formal pseudocode or algorithm blocks.
Open Source Code	Yes	For videos and code, see https://vdean.github.io/audio-curiosity.html.
Open Datasets	No	The paper describes the use of Atari games and the Habitat simulator as environments for training agents. While these environments are generally accessible, the paper does not provide specific links, DOIs, or formal citations for a 'dataset' in the sense of a static collection of data with explicit public access information for reproducibility of data collection.
Dataset Splits	No	The paper describes training agents within environments (Atari and Habitat) for a specified number of frames but does not mention explicit training, validation, or test dataset splits in the context of a static dataset.
Hardware Specification	No	The paper does not provide specific details about the hardware used for running the experiments, such as GPU or CPU models, or memory specifications.
Software Dependencies	No	The paper mentions software like Gym Retro and Habitat, and algorithms like PPO, but does not provide specific version numbers for these or other ancillary software components used in the experiments.
Experiment Setup	Yes	During training, the agent policy is rolled out in parallel environments. These yield trajectories which are each chunked into 128 time steps. [...] To compute audio features, we take an audio clip spanning 4 time steps (1/15th of a second for these 60 frame per second environments) and apply a Fast Fourier Transform (FFT). The FFT output is downsampled using max pooling to a 512-dimensional feature vector, which is used as input to the discriminator along with a 512-dimensional visual feature vector. We trained our approach and baselines for 200 million frames using the intrinsic reward and measure performance by the extrinsic reward throughout learning.