Spatially-Aware Transformers for Embodied Agents
Authors: Junmo Cho, Jaesik Yoon, Sungjin Ahn
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | This section presents an empirical evaluation of the Spatially-Aware Transformer (SAT) and Adaptive Memory Allocator (AMA) across various environments and downstream tasks. We begin by evaluating the models on prediction tasks in the Room Ballet environment. Then, we demonstrate its capability to build an action-conditioned world model and spatially-aware image generation model. Lastly, we showcase the use of SAT-AMA for training a downstream reinforcement learning policy. |
| Researcher Affiliation | Collaboration | Junmo Cho1 , Jaesik Yoon1,2 , Sungjin Ahn1 1KAIST & 2SAP |
| Pseudocode | Yes | We introduce the pseudo-codes of SAT-AMA, SAT(-FIFO), DNC training for a supervised learning task in Algorithm. 1-3. |
| Open Source Code | Yes | The source code for our models and experiments will be available at https://github.com/junmokane/spatially-aware-transformer. |
| Open Datasets | Yes | on the facial images from FFHQ dataset (Karras et al., 2019), where 62,000 face images are used for training and 7,000 images among the remaining images are used for evaluation. |
| Dataset Splits | No | The paper mentions training and evaluation sets for the FFHQ dataset (62,000 for training and 7,000 for evaluation), but it does not explicitly specify a separate validation set or percentages for a full train/validation/test split for any of its experiments. |
| Hardware Specification | Yes | Our study was performed on an Intel server equipped with 8 NVIDIA RTX 3090 GPUs and 256GB of memory. |
| Software Dependencies | No | The paper mentions software components like "Adam optimizer" and "ReLU activation function" and refers to the "Open AI embeddings API 2" with the model "text-embedding-ada-002." However, it does not provide specific version numbers for broader software dependencies such as Python, PyTorch, TensorFlow, or scikit-learn libraries, which are crucial for full reproducibility. |
| Experiment Setup | Yes | For all supervised learning tasks, we used a batch size of 32, Adam optimizer Kingma & Ba (2014) with a learning rate of 0.0002, and gradient clipping over the range [ 5.0, 5.0]. We stacked four memory layers, each of which consists of an LA block, an HCAM block, and an MLP block. We used a dimension of 128 for embedding vectors. For the LA and HCAM blocks, we used 2 heads and a head dimension of 64 for multi-head attention operations. For the MLP block, we used a 2-layer MLP with a hidden dimension of 128, and a ReLU Nair & Hinton (2010) activation function. For the Q network, we used a 1-layer MLP with a hidden dimension of 64 and a ReLU activation function. We also used ϵ-annealing for Q-learning Mnih et al. (2015), starting from 1.0 and decreasing to 0.2 over 200,000 steps. |