Intra-agent speech permits zero-shot task acquisition

Authors: Chen Yan, Federico Carnevale, Petko I Georgiev, Adam Santoro, Aurelia Guy, Alistair Muldal, Chia-Chun Hung, Joshua Abramson, Timothy Lillicrap, Gregory Wayne

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We then experimentally compute scaling curves over different amounts of labeled data and compare the data efficiency against a supervised learning baseline. Finally, we incorporate intra-agent speech into an embodied, mobile manipulator agent operating in a 3D virtual world, and show that with as few as 150 additional image captions, intra-agent speech endows the agent with the ability to manipulate and answer questions about a new object without any related task-directed experience (zero-shot).
Researcher Affiliation Industry Chen Yan Deep Mind London, UK ywc@deepmind.com Federico Carnevale Deep Mind London, UK fedecarnev@deepmind.com Petko Georgiev Deep Mind London, UK petkoig@deepmind.com Adam Santoro Deep Mind London, UK adamsantoro@deepmind.com Aurelia Guy Open AI San Francisco, USA 7aureliaguy@gmail.com Alistair Muldal Deep Mind London, UK alimuldal@deepmind.com Chia-Chun Hung Isomorphic Labs London, UK aldenhung@google.com Josh Abramson Deep Mind London, UK jabramson@deepmind.com Timothy Lillicrap Deep Mind London, UK countzero@deepmind.com Gregory Wayne Deep Mind London, UK gregwayne@deepmind.com
Pseudocode No The paper describes algorithms but does not provide pseudocode or a clearly labeled algorithm block.
Open Source Code No Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [No] The code and the data are proprietary.
Open Datasets No As mentioned, the domain in which we tested our methods is called the Playhouse, which originated in Interactive Agents Team [22]. The authors compiled approximately three years worth of interaction data, comprising about 2 billion image frames. The authors have confirmed no personally identifiable information or offensive content is contained in the dataset and gave consent for us to access it. (...) Consent for using the data was obtained from Interactive Agents Team [22] under proprietary contract. See Section 3.1.
Dataset Splits No We measured the log probability of captions in the validation set, CIDEr score [30], object precision and color-object pair precision. (...) For the paired dataset, we engaged with crowd raters to provide corresponding captions for 78K uniformly sampled images from the unpaired dataset. The paper mentions the use of a 'validation set' and evaluation on 'validation data', but does not provide specific percentages or counts for how the dataset was split into training, validation, and test sets.
Hardware Specification Yes We trained our models using Tensor Processing Units (TPUv3) [28].
Software Dependencies No The paper mentions software components and architectures like ResNet, Transformer, VQ-VAE, Adam optimizer, and SentencePiece, but does not provide specific version numbers for these or for programming languages/libraries (e.g., Python, PyTorch, TensorFlow).
Experiment Setup Yes The image-conditional language encoder received input images with resolution 96 72. We used the Res Net architecture described in [22], with strides (1, 1, 2, 2), 3 3 kernel size, and channel sizes (64, 64, 128, 256) for a total of 20 layers. Language was produced by a 4-layer causal transformer with 256 embedding size and 4 attention heads [23], which attended to the 432 hyper-pixel vectors generated from the Res Net, and produced a sequence of logits corresponding to a 4, 000 token vocabulary. (...) The loss was optimized by the Adam optimizer [27] with β1 = 0.9, β2 = 0.999 and a learning rate of 2 10 4 with early stopping at the lowest log-likelihood over captions in validation data. We trained all models with a batch size of 128 except for the contrastive classifier model, which also received 2, 048 unlabeled images per batch, giving a total batch size of 2, 176.