Hopper: Multi-hop Transformer for Spatiotemporal Reasoning
Authors: Honglu Zhou, Asim Kadav, Farley Lai, Alexandru Niculescu-Mizil, Martin Renqiang Min, Mubbasir Kapadia, Hans Peter Graf
ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate over CATER dataset and find that Hopper achieves 73.2% Top-1 accuracy using just 1 FPS by hopping through just a few critical frames. We also demonstrate Hopper can perform long-term reasoning by building a CATER-h dataset1 that requires multi-step reasoning to localize objects of interest correctly. |
| Researcher Affiliation | Collaboration | Honglu Zhou1 , Asim Kadav2, Farley Lai2, Alexandru Niculescu-Mizil2, Martin Renqiang Min2, Mubbasir Kapadia1, Hans Peter Graf2 1 Department of Computer Science, Rutgers University, Piscataway, NJ, USA 2 NEC Laboratories America, Inc., San Jose, CA, USA |
| Pseudocode | Yes | The overall module is described in Algorithm 1. MHT accepts a frame track Tf: [i1, i2, , i T ], an object track To: [o1 1, o1 2, , o1 T , , o N 1 , o N 2 , , o N T ], an initial target video query embedding E, the number of objects N and number of frames T. h denotes the hop index, and t is the frame index that the previous hop (i.e., iteration) mostly attended to, in Algorithm 1. |
| Open Source Code | No | The paper links to a dataset repository (https://github.com/necla-ml/cater-h) and mentions using other authors' implementations, but it does not provide an explicit statement or link to the open-source code for the methodology described in this paper. |
| Open Datasets | Yes | We evaluate over CATER dataset and find that Hopper achieves 73.2% Top-1 accuracy using just 1 FPS by hopping through just a few critical frames. We also demonstrate Hopper can perform long-term reasoning by building a CATER-h dataset1 that requires multi-step reasoning to localize objects of interest correctly. 1https://github.com/necla-ml/cater-h |
| Dataset Splits | No | The paper states 'We split the data randomly in 70 : 30 ratio into a training and test set, resulting in 5, 624 training samples and 1, 456 testing samples.' While it mentions a validation set implicitly (e.g., 'when validation loss saturates'), it does not explicitly provide the split proportions or sample counts for the validation set within the paper's text. |
| Hardware Specification | No | The paper does not specify any particular hardware used for running the experiments, such as GPU or CPU models, or cloud computing instance types. |
| Software Dependencies | No | The paper mentions several software components and frameworks used (e.g., DETR, Adam optimizer, TSM, TPN, SINet implementations), but it does not provide specific version numbers for any of these software dependencies. |
| Experiment Setup | Yes | The initial learning rate was set to 10 4 and weight decay to 10 3. The batch size was 16. The number of attention heads for DETR was set to 8 and for the Multi-hop Transformer was set to 2. Transformer dropout rate was set to 0.1. |