Compositional Video Synthesis with Action Graphs
Authors: Amir Bar, Roei Herzig, Xiaolong Wang, Anna Rohrbach, Gal Chechik, Trevor Darrell, Amir Globerson
ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We train and evaluate AG2Vid on the CATER and Something Something V2 datasets, and show that the resulting videos have better visual quality and semantic consistency compared to baselines. |
| Researcher Affiliation | Collaboration | 1The Blavatnik School of Computer Science, Tel Aviv University 2UC San Diego 3UC Berkeley 4NVIDIA Research 5Bar-Ilan University. |
| Pseudocode | No | The paper describes the model architecture and processes in narrative text and figures (e.g., Figure 3) but does not provide structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | See the project page for code and pretrained models: https://roeiherz.github.io/AG2Video. |
| Open Datasets | Yes | We use two datasets: (1) CATER (Girdhar & Ramanan, 2020)... (2) Something-Something V2 (Goyal et al., 2017)... |
| Dataset Splits | Yes | We employ the standard CATER training partition (3,849 videos) and split the validation into 30% val (495 videos) and use the rest for testing (1,156 videos). |
| Hardware Specification | Yes | Models were trained with a batch size of 2 which was the maximal batch size to fit on a single NVIDIA V100 GPU. |
| Software Dependencies | No | The paper mentions software components and frameworks like ADAM, SPADE generator, and GCNs, but does not provide specific version numbers for any software dependencies (e.g., Python, PyTorch, CUDA versions). |
| Experiment Setup | Yes | The GCN model uses K = 3 hidden layers and an embedding layer of 128 units for each object and action. For optimization we use ADAM (Kingma & Ba, 2014) with lr = 1e 4 and (β1, β2) = (0.5, 0.99). Models were trained with a batch size of 2... For loss weights we use λB = λF = λP = 10 and λA = 1. We use videos of 8 FPS and 6 FPS for CATER and Smth V2 and evaluate on videos consisting of 16 frames... |