Drop-DTW: Aligning Common Signal Between Sequences While Dropping Outliers

Authors: Mikita Dvornik, Isma Hadji, Konstantinos G. Derpanis, Animesh Garg, Allan Jepson

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In our experiments, we show that Drop-DTW is a robust similarity measure for sequence retrieval and demonstrate its effectiveness as a training loss on diverse applications. With Drop-DTW, we address temporal step localization on instructional videos, representation learning from noisy videos, and cross-modal representation learning for audio-visual retrieval and localization. In all applications, we take a weakly- or unsupervised approach and demonstrate state-of-the-art results under these settings.
Researcher Affiliation Collaboration Nikita Dvornik 1,2 Isma Hadji 1 Konstantinos G. Derpanis 1 Animesh Garg 2 Allan D. Jepson 1 1Samsung AI Centre Toronto 2University of Toronto, Vector Institute
Pseudocode Yes Algorithm 1 Subsequence alignment with Drop-DTW.
Open Source Code Yes Our code is available at: https://github.com/SamsungLabs/Drop-DTW.
Open Datasets Yes Synthetic dataset. We use the MNIST dataset [55] to generate videos of moving digits (cf. [56]). Datasets. For evaluation, we use the following three recent instructional video datasets: Cross Task [9], COIN [10], and You Cook2 [11]. We train the same network used in previous work [15, 1] using the alignment proxy task on Penn Action [12]. Finally, we demonstrate the strengths of Drop-DTW across a range of applications... We adopt their encoder architecture and evaluation protocol; for additional details, please see the supplemental. [13] Y. Tian, J. Shi, B. Li, Z. Duan, and C. Xu, Audio-visual event localization in unconstrained videos, in Proceedings of the European Conference on Computer Vision (ECCV), 2018.
Dataset Splits No The paper does not explicitly provide numerical train/validation/test dataset splits (e.g., percentages or absolute counts) required for reproduction. It describes how data is used for training and inference, but not the split methodology itself.
Hardware Specification No The paper does not specify the hardware used for running experiments, such as specific GPU models, CPU types, or cloud computing instance details.
Software Dependencies No The paper does not provide specific version numbers for software dependencies or libraries used in the experiments (e.g., 'Python 3.8, PyTorch 1.9').
Experiment Setup Yes We further train a two-layer fully-connected network on top of the visual embeddings alone to align videos with a list of corresponding step (language) embeddings using the Drop-DTW loss, (8). To regularize the training, we introduce an additional clustering loss, Lclust, defined in the supplemental. Finally, we use LDTW with asymmetric costs, (4), and either a 30%-percentile drop cost, (5), or the learned variant, (6), in combination with Lclust during training. For this experiment, we use the symmetric matching costs defined in (3). Since no training is involved in this experiment, we set the drop costs to a constant, dx = dz = 0.3, which we establish through cross-validation. When applying Drop-DTW, we use symmetric match costs, (3), and 70%-percentile drop costs, (5).