reproducibilityindex.ai

CATER: A diagnostic dataset for Compositional Actions & TEmporal Reasoning

Authors: Rohit Girdhar, Deva Ramanan

ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We use CATER to benchmark state-of-the-art video understanding models (Wang et al., 2018; 2016b; Hochreiter & Schmidhuber, 1997), and show even the best models struggle on our dataset. We also uncover some insights into the behavior of these models by changing parameters such as the temporal duration of an occlusion, the degree of camera motion, etc., which are difﬁcult to both tune and label in real-world video data.
Researcher Affiliation	Collaboration	Rohit Girdhar1 Deva Ramanan1,2 1Carnegie Mellon University 2Argo AI Now at Facebook AI Research
Pseudocode	No	No pseudocode or algorithm blocks are found in the paper.
Open Source Code	Yes	http://rohitgirdhar.github.io/CATER and With the code release we also provide a further split of train set into a validation set (80:20).
Open Datasets	Yes	Our dataset, named CATER, is rendered synthetically using a library of standard 3D objects, and tests the ability to recognize compositions of object movements that require long-term reasoning. In addition to being a challenging dataset, CATER also provides a plethora of diagnostic tools to analyze modern spatiotemporal video architectures by being completely observable and controllable. http://rohitgirdhar.github.io/CATER
Dataset Splits	Yes	We split the data randomly in 70:30 ratio into a training and test set. We similarly render a same size dataset with camera motion, and deﬁne tasks and splits in the same way as for the static camera. With the code release we also provide a further split of train set into a validation set (80:20).
Hardware Specification	No	The paper does not provide specific hardware details such as GPU/CPU models, memory, or processor types used for running the experiments.
Software Dependencies	No	The paper mentions using implementations of R3D, non-local blocks, TSN, and TVL1 for optical flow, but does not provide specific software names with version numbers (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup	Yes	For (Wang et al., 2018), all the models are based on Res Net-50 base architecture, and trained with hyperparameters scaled down from Kinetics as per CATER size. For non-local (NL) experiments, we replace the conv3 and conv4 blocks in Res Net with the NL blocks. All models are trained with classiﬁcation loss implemented using sigmoid cross-entropy for Task 1 and 2 (multi-label classiﬁcation task), and softmax cross-entropy for task 3. At test time, we split the video into 10 temporal clips and 3 spatial clips. When aggregating using average pooling, we average the predictions from all 30-clips. For LSTM, we train and test on the 10 center clips. We experiment with varying the number of frames (#frames) and sampling rate (SR).