Hopper: Multi-hop Transformer for Spatiotemporal Reasoning

Authors: Honglu Zhou, Asim Kadav, Farley Lai, Alexandru Niculescu-Mizil, Martin Renqiang Min, Mubbasir Kapadia, Hans Peter Graf

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate over CATER dataset and find that Hopper achieves 73.2% Top-1 accuracy using just 1 FPS by hopping through just a few critical frames. We also demonstrate Hopper can perform long-term reasoning by building a CATER-h dataset1 that requires multi-step reasoning to localize objects of interest correctly.
Researcher Affiliation Collaboration Honglu Zhou1 , Asim Kadav2, Farley Lai2, Alexandru Niculescu-Mizil2, Martin Renqiang Min2, Mubbasir Kapadia1, Hans Peter Graf2 1 Department of Computer Science, Rutgers University, Piscataway, NJ, USA 2 NEC Laboratories America, Inc., San Jose, CA, USA
Pseudocode Yes The overall module is described in Algorithm 1. MHT accepts a frame track Tf: [i1, i2, , i T ], an object track To: [o1 1, o1 2, , o1 T , , o N 1 , o N 2 , , o N T ], an initial target video query embedding E, the number of objects N and number of frames T. h denotes the hop index, and t is the frame index that the previous hop (i.e., iteration) mostly attended to, in Algorithm 1.
Open Source Code No The paper links to a dataset repository (https://github.com/necla-ml/cater-h) and mentions using other authors' implementations, but it does not provide an explicit statement or link to the open-source code for the methodology described in this paper.
Open Datasets Yes We evaluate over CATER dataset and find that Hopper achieves 73.2% Top-1 accuracy using just 1 FPS by hopping through just a few critical frames. We also demonstrate Hopper can perform long-term reasoning by building a CATER-h dataset1 that requires multi-step reasoning to localize objects of interest correctly. 1https://github.com/necla-ml/cater-h
Dataset Splits No The paper states 'We split the data randomly in 70 : 30 ratio into a training and test set, resulting in 5, 624 training samples and 1, 456 testing samples.' While it mentions a validation set implicitly (e.g., 'when validation loss saturates'), it does not explicitly provide the split proportions or sample counts for the validation set within the paper's text.
Hardware Specification No The paper does not specify any particular hardware used for running the experiments, such as GPU or CPU models, or cloud computing instance types.
Software Dependencies No The paper mentions several software components and frameworks used (e.g., DETR, Adam optimizer, TSM, TPN, SINet implementations), but it does not provide specific version numbers for any of these software dependencies.
Experiment Setup Yes The initial learning rate was set to 10 4 and weight decay to 10 3. The batch size was 16. The number of attention heads for DETR was set to 8 and for the Multi-hop Transformer was set to 2. Transformer dropout rate was set to 0.1.