Drop-DTW: Aligning Common Signal Between Sequences While Dropping Outliers
Authors: Mikita Dvornik, Isma Hadji, Konstantinos G. Derpanis, Animesh Garg, Allan Jepson
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In our experiments, we show that Drop-DTW is a robust similarity measure for sequence retrieval and demonstrate its effectiveness as a training loss on diverse applications. With Drop-DTW, we address temporal step localization on instructional videos, representation learning from noisy videos, and cross-modal representation learning for audio-visual retrieval and localization. In all applications, we take a weakly- or unsupervised approach and demonstrate state-of-the-art results under these settings. |
| Researcher Affiliation | Collaboration | Nikita Dvornik 1,2 Isma Hadji 1 Konstantinos G. Derpanis 1 Animesh Garg 2 Allan D. Jepson 1 1Samsung AI Centre Toronto 2University of Toronto, Vector Institute |
| Pseudocode | Yes | Algorithm 1 Subsequence alignment with Drop-DTW. |
| Open Source Code | Yes | Our code is available at: https://github.com/SamsungLabs/Drop-DTW. |
| Open Datasets | Yes | Synthetic dataset. We use the MNIST dataset [55] to generate videos of moving digits (cf. [56]). Datasets. For evaluation, we use the following three recent instructional video datasets: Cross Task [9], COIN [10], and You Cook2 [11]. We train the same network used in previous work [15, 1] using the alignment proxy task on Penn Action [12]. Finally, we demonstrate the strengths of Drop-DTW across a range of applications... We adopt their encoder architecture and evaluation protocol; for additional details, please see the supplemental. [13] Y. Tian, J. Shi, B. Li, Z. Duan, and C. Xu, Audio-visual event localization in unconstrained videos, in Proceedings of the European Conference on Computer Vision (ECCV), 2018. |
| Dataset Splits | No | The paper does not explicitly provide numerical train/validation/test dataset splits (e.g., percentages or absolute counts) required for reproduction. It describes how data is used for training and inference, but not the split methodology itself. |
| Hardware Specification | No | The paper does not specify the hardware used for running experiments, such as specific GPU models, CPU types, or cloud computing instance details. |
| Software Dependencies | No | The paper does not provide specific version numbers for software dependencies or libraries used in the experiments (e.g., 'Python 3.8, PyTorch 1.9'). |
| Experiment Setup | Yes | We further train a two-layer fully-connected network on top of the visual embeddings alone to align videos with a list of corresponding step (language) embeddings using the Drop-DTW loss, (8). To regularize the training, we introduce an additional clustering loss, Lclust, defined in the supplemental. Finally, we use LDTW with asymmetric costs, (4), and either a 30%-percentile drop cost, (5), or the learned variant, (6), in combination with Lclust during training. For this experiment, we use the symmetric matching costs defined in (3). Since no training is involved in this experiment, we set the drop costs to a constant, dx = dz = 0.3, which we establish through cross-validation. When applying Drop-DTW, we use symmetric match costs, (3), and 70%-percentile drop costs, (5). |