AntGPT: Can Large Language Models Help Long-term Action Anticipation from Videos?
Authors: Qi Zhao, Shijie Wang, Ce Zhang, Changcheng Fu, Minh Quan Do, Nakul Agarwal, Kwonjoon Lee, Chen Sun
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We propose Ant GPT, which represents video observations as sequences of human actions, and uses the action representation for an LLM to infer the goals and model temporal dynamics. Ant GPT achieves stateof-the-art performance on Ego4D LTA v1 and v2, EPIC-Kitchens-55, as well as EGTEA GAZE+, thanks to LLMs goal inference and temporal dynamics modeling capabilities. We conduct experiments on multiple LTA benchmarks, including Ego4D (Grauman et al., 2022), EPIC-Kitchens-55 (Damen et al., 2020), and EGTEA GAZE+ (Li et al., 2018). |
| Researcher Affiliation | Collaboration | Qi Zhao Brown University Shijie Wang Brown University Ce Zhang Brown University Changcheng Fu Brown University Minh Quan Do Brown University Nakul Agarwal Honda Research Institute Kwonjoon Lee Honda Research Institute Chen Sun Brown University |
| Pseudocode | No | The paper describes its methods in prose and uses diagrams but does not contain any explicit pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code and model are available at brown-palm.github.io/Ant GPT. |
| Open Datasets | Yes | Ego4D Grauman et al. (2022) ... EPIC-Kitchens-55 (Damen et al., 2020) (EK-55) ... EGTEA Gaze+ (Li et al., 2018) (EGTEA) |
| Dataset Splits | Yes | We follow the datasets standard splits. We adopt the train and test splits from Nagarajan et al. (2020). All results are reported on the validation set. All hyper-parameters are chosen by minimizing the loss on the validation set. |
| Hardware Specification | Yes | All experiments are accompolished on NVIDIA A6000 GPUs. |
| Software Dependencies | No | The paper mentions several software components and models like CLIP, Llama2, GPT-3.5, and PEFT with LoRA, but it does not specify concrete version numbers for the programming languages, libraries, or frameworks used (e.g., Python, PyTorch, CUDA versions), which are crucial for exact reproducibility. |
| Experiment Setup | Yes | We use 3 layers, 8 heads of a vanilla Transformer Encoder with a hidden representation dimension of 2048. We use Nesterov Momentum SGD + Cosine Annealing Scheduler with learning rate 5e-4. We train the model for 30 epochs with the first 4 as warm-up epochs. For the recognition model, we use Nesterov Momentum SGD + Cosine Annealing Scheduler with learning rate 1e-3. We train the model for 40 epochs with 4 warm-up epochs. For EK-55, we use Adam optimizer with learning rate 5e-5. For Gaze, we use Nesterov Momentum SGD + Cosine Annealing Scheduler with learning rate 2e-2. |