Emergent Agentic Transformer from Chain of Hindsight Experience

Authors: Hao Liu, Pieter Abbeel

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental As we show on D4RL and Exo RL benchmarks, to the best our knowledge, this is the first time that a simple transformer-based model performs competitively with both temporal-difference and imitation-learning-based approaches, even from sub-optimal data. Our Agentic Transformer also shows a promising scaling trend that bigger models consistently improve results.
Researcher Affiliation Academia Hao Liu Pieter Abbeel University of California, Berkeley
Pseudocode Yes Algorithm 1 Training Agentic Transformer
Open Source Code No The code of Agentic Transformer will be made publicly available for future research.
Open Datasets Yes Dataset: D4RL. In this section, we consider the continuous control tasks from the D4RL benchmark (Fu et al., 2020). Dataset: Exo RL. The Exo RL dataset is based on unlabeled exploratory data collected by running unsupervised RL algorithms. For each environment, it comes with eight different unsupervised data collection algorithms, taken from from URLB (Laskin et al., 2021).
Dataset Splits No The paper refers to training and testing but does not explicitly detail validation splits or specific percentages for any data splits within the main text beyond mentioning the use of standard benchmarks.
Hardware Specification Yes Our experiments are conducted on TPUv3 32 using Jax and Flax. On 32 TPUv3, each experiment takes around 4 hours on D4RL and around 6 hours on Exo RL.
Software Dependencies No The paper mentions 'Jax and Flax' but does not provide specific version numbers for these software components.
Experiment Setup Yes Our hyperparameters on all tasks are shown below in Table 5. Table 5 lists: Number of layers 3, Number of attention heads 1, Embedding dimension 128, Activation function Re LU, Batch size 64, Dropout 0.1, Learning rate 10 4, Learning rate decay Linear warmup for 105 steps, Grad norm clip 0.25, Weight decay 10 4, Initial desired target return at test time (D4RL), Initial desired target return at test time (Exo RL), Number of trajectories to form chain of hindsight experience during training 4, Number of trajectories at test time 4.