STREAMER: Streaming Representation Learning and Event Segmentation in a Hierarchical Manner
Authors: Ramy Mounir, Sujal Vijayaraghavan, Sudeep Sarkar
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate the performance of our model on the egocentric EPIC-KITCHENS dataset, specifically focusing on temporal event segmentation. Furthermore, we conduct event retrieval experiments using the learned representations to demonstrate the high quality of our video event representations. |
| Researcher Affiliation | Academia | Ramy Mounir Sujal Vijayaraghavan Sudeep Sarkar Department of Computer Science and Engineering, University of South Florida, Tampa {ramy, sujal, sarkar}@usf.edu |
| Pseudocode | Yes | Algorithm 1 : Hierarchy Level Reduction. Given a list of the highest level annotations AL from the predicted hierarchy and the ground truth annotations G, this algorithm finds the optimal match of the predicted annotations across the hierarchy with the ground truth while avoiding any temporal overlap between events. |
| Open Source Code | No | Illustration videos and code are available on our project page: https://ramymounir.com/publications/streamer. This is a project page, not a direct link to a source code repository. |
| Open Datasets | Yes | In our training and evaluation, we use two large-scale egocentric datasets: Ego4D [12] and EPIC-KITCHENS 100 [20]. |
| Dataset Splits | No | We train our model in a self-supervised layer-by-layer manner... on a random 20% subset of Ego4D and 80% of EPIC-KITCHENS, then evaluate on the rest 20% of EPIC-KITCHENS. Protocol 1 divides EPIC-KITCHENS such that the 20% test split comes from kitchens that have not been seen in the training set, whereas Protocol 2 ensures that the kitchens in the test set are also in the training set. The paper specifies train and test splits but does not mention a separate validation split. |
| Hardware Specification | No | The paper mentions architectural components like '4-layer CNN autoencoder' and 'transformer encoder' and that 'eight parallel streams' were trained, but it does not provide specific hardware details such as GPU or CPU models, or cloud computing specifications. |
| Software Dependencies | No | The paper mentions using the 'Adam optimizer' and 'cosine similarity' but does not provide specific version numbers for any software dependencies or frameworks (e.g., Python, PyTorch/TensorFlow, CUDA). |
| Experiment Setup | Yes | We resize video frames to 128 128 3 and use a 4-layer CNN autoencoder... we sample frames at 2 fps... use the Adam optimizer with a constant learning rate of 1e 4 for training... A window size w of 50 inputs... a new layer (l + 1) is added to the stack after layer (l) has processed 50K inputs. |