OPEL: Optimal Transport Guided ProcedurE Learning
Authors: Sayeed Shafayet Chowdhury, Soumyadeep Chandra, Kaushik Roy
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Specifically, we achieve 22.4% (Io U) and 26.9% (F1) average improvement compared to the current SOTA on large scale egocentric benchmark, Ego Proce L. Furthermore, for the third person benchmarks (Pro Ce L and Cross Task), the proposed approach obtains 46.2% (F1) average enhancement over SOTA. |
| Researcher Affiliation | Academia | Sayeed Shafayet Chowdhury, Soumyadeep Chandra, and Kaushik Roy Elmore Family School of Electrical and Computer Engineering Purdue University, West Lafayette, IN 47907, USA |
| Pseudocode | No | The paper describes the proposed methods using mathematical formulations and descriptive text, but it does not include a distinct pseudocode block or algorithm listing. |
| Open Source Code | Yes | Our code is provided as part of the supplementary material. |
| Open Datasets | Yes | For 3rd person view, we utilize established benchmark datasets, namely Cross Task [11] and Proce L [3]. To evaluate the effectiveness of our proposed OPEL framework, we apply it to the 1st-person Ego Proce L benchmark [1], which contains 62 hours of egocentric video recordings from 130 subjects engaged in 16 tasks. |
| Dataset Splits | No | The paper mentions training on videos and evaluating on benchmark datasets but does not explicitly describe a separate validation split with percentages or sample counts. Hyperparameters are listed in Table A1, which typically refers to training parameters, not validation split details. |
| Hardware Specification | Yes | We utilize a single Nvidia A40 GPU, but its full RAM is not required. The GPU memory is dependent on batch size (bs). For a bs of 2, a GPU equipped with approximately 12GB of memory is sufficient for our purposes. |
| Software Dependencies | No | The paper mentions using 'Res Net-50 (pretrained on Image Net) as the embedder network' and 'Adam' optimizer. However, it does not specify version numbers for these or other software libraries or dependencies, which is required for reproducibility. |
| Experiment Setup | Yes | We employ Res Net-50 (pretrained on Image Net) as the embedder network. ... The feature extraction is conducted from the Conv4c layer, and we subsequently create a stack of 2 context frames along the temporal dimension. Our video frames are resized to 224 224. The aggregated features are processed through two 3D convolutional layers, followed by a 3D global max pooling layer, two fully-connected layers, and a linear projection layer that outputs embeddings of 128 dimensions. All hyper-parameters are listed in Table A1. |