AntGPT: Can Large Language Models Help Long-term Action Anticipation from Videos?

Authors: Qi Zhao, Shijie Wang, Ce Zhang, Changcheng Fu, Minh Quan Do, Nakul Agarwal, Kwonjoon Lee, Chen Sun

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We propose Ant GPT, which represents video observations as sequences of human actions, and uses the action representation for an LLM to infer the goals and model temporal dynamics. Ant GPT achieves stateof-the-art performance on Ego4D LTA v1 and v2, EPIC-Kitchens-55, as well as EGTEA GAZE+, thanks to LLMs goal inference and temporal dynamics modeling capabilities. We conduct experiments on multiple LTA benchmarks, including Ego4D (Grauman et al., 2022), EPIC-Kitchens-55 (Damen et al., 2020), and EGTEA GAZE+ (Li et al., 2018).
Researcher Affiliation Collaboration Qi Zhao Brown University Shijie Wang Brown University Ce Zhang Brown University Changcheng Fu Brown University Minh Quan Do Brown University Nakul Agarwal Honda Research Institute Kwonjoon Lee Honda Research Institute Chen Sun Brown University
Pseudocode No The paper describes its methods in prose and uses diagrams but does not contain any explicit pseudocode or algorithm blocks.
Open Source Code Yes Code and model are available at brown-palm.github.io/Ant GPT.
Open Datasets Yes Ego4D Grauman et al. (2022) ... EPIC-Kitchens-55 (Damen et al., 2020) (EK-55) ... EGTEA Gaze+ (Li et al., 2018) (EGTEA)
Dataset Splits Yes We follow the datasets standard splits. We adopt the train and test splits from Nagarajan et al. (2020). All results are reported on the validation set. All hyper-parameters are chosen by minimizing the loss on the validation set.
Hardware Specification Yes All experiments are accompolished on NVIDIA A6000 GPUs.
Software Dependencies No The paper mentions several software components and models like CLIP, Llama2, GPT-3.5, and PEFT with LoRA, but it does not specify concrete version numbers for the programming languages, libraries, or frameworks used (e.g., Python, PyTorch, CUDA versions), which are crucial for exact reproducibility.
Experiment Setup Yes We use 3 layers, 8 heads of a vanilla Transformer Encoder with a hidden representation dimension of 2048. We use Nesterov Momentum SGD + Cosine Annealing Scheduler with learning rate 5e-4. We train the model for 30 epochs with the first 4 as warm-up epochs. For the recognition model, we use Nesterov Momentum SGD + Cosine Annealing Scheduler with learning rate 1e-3. We train the model for 40 epochs with 4 warm-up epochs. For EK-55, we use Adam optimizer with learning rate 5e-5. For Gaze, we use Nesterov Momentum SGD + Cosine Annealing Scheduler with learning rate 2e-2.