CATER: A diagnostic dataset for Compositional Actions & TEmporal Reasoning

Authors: Rohit Girdhar, Deva Ramanan

ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We use CATER to benchmark state-of-the-art video understanding models (Wang et al., 2018; 2016b; Hochreiter & Schmidhuber, 1997), and show even the best models struggle on our dataset. We also uncover some insights into the behavior of these models by changing parameters such as the temporal duration of an occlusion, the degree of camera motion, etc., which are difficult to both tune and label in real-world video data.
Researcher Affiliation Collaboration Rohit Girdhar1 Deva Ramanan1,2 1Carnegie Mellon University 2Argo AI Now at Facebook AI Research
Pseudocode No No pseudocode or algorithm blocks are found in the paper.
Open Source Code Yes http://rohitgirdhar.github.io/CATER and With the code release we also provide a further split of train set into a validation set (80:20).
Open Datasets Yes Our dataset, named CATER, is rendered synthetically using a library of standard 3D objects, and tests the ability to recognize compositions of object movements that require long-term reasoning. In addition to being a challenging dataset, CATER also provides a plethora of diagnostic tools to analyze modern spatiotemporal video architectures by being completely observable and controllable. http://rohitgirdhar.github.io/CATER
Dataset Splits Yes We split the data randomly in 70:30 ratio into a training and test set. We similarly render a same size dataset with camera motion, and define tasks and splits in the same way as for the static camera. With the code release we also provide a further split of train set into a validation set (80:20).
Hardware Specification No The paper does not provide specific hardware details such as GPU/CPU models, memory, or processor types used for running the experiments.
Software Dependencies No The paper mentions using implementations of R3D, non-local blocks, TSN, and TVL1 for optical flow, but does not provide specific software names with version numbers (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup Yes For (Wang et al., 2018), all the models are based on Res Net-50 base architecture, and trained with hyperparameters scaled down from Kinetics as per CATER size. For non-local (NL) experiments, we replace the conv3 and conv4 blocks in Res Net with the NL blocks. All models are trained with classification loss implemented using sigmoid cross-entropy for Task 1 and 2 (multi-label classification task), and softmax cross-entropy for task 3. At test time, we split the video into 10 temporal clips and 3 spatial clips. When aggregating using average pooling, we average the predictions from all 30-clips. For LSTM, we train and test on the 10 center clips. We experiment with varying the number of frames (#frames) and sampling rate (SR).