STREAMER: Streaming Representation Learning and Event Segmentation in a Hierarchical Manner

Authors: Ramy Mounir, Sujal Vijayaraghavan, Sudeep Sarkar

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate the performance of our model on the egocentric EPIC-KITCHENS dataset, specifically focusing on temporal event segmentation. Furthermore, we conduct event retrieval experiments using the learned representations to demonstrate the high quality of our video event representations.
Researcher Affiliation Academia Ramy Mounir Sujal Vijayaraghavan Sudeep Sarkar Department of Computer Science and Engineering, University of South Florida, Tampa {ramy, sujal, sarkar}@usf.edu
Pseudocode Yes Algorithm 1 : Hierarchy Level Reduction. Given a list of the highest level annotations AL from the predicted hierarchy and the ground truth annotations G, this algorithm finds the optimal match of the predicted annotations across the hierarchy with the ground truth while avoiding any temporal overlap between events.
Open Source Code No Illustration videos and code are available on our project page: https://ramymounir.com/publications/streamer. This is a project page, not a direct link to a source code repository.
Open Datasets Yes In our training and evaluation, we use two large-scale egocentric datasets: Ego4D [12] and EPIC-KITCHENS 100 [20].
Dataset Splits No We train our model in a self-supervised layer-by-layer manner... on a random 20% subset of Ego4D and 80% of EPIC-KITCHENS, then evaluate on the rest 20% of EPIC-KITCHENS. Protocol 1 divides EPIC-KITCHENS such that the 20% test split comes from kitchens that have not been seen in the training set, whereas Protocol 2 ensures that the kitchens in the test set are also in the training set. The paper specifies train and test splits but does not mention a separate validation split.
Hardware Specification No The paper mentions architectural components like '4-layer CNN autoencoder' and 'transformer encoder' and that 'eight parallel streams' were trained, but it does not provide specific hardware details such as GPU or CPU models, or cloud computing specifications.
Software Dependencies No The paper mentions using the 'Adam optimizer' and 'cosine similarity' but does not provide specific version numbers for any software dependencies or frameworks (e.g., Python, PyTorch/TensorFlow, CUDA).
Experiment Setup Yes We resize video frames to 128 128 3 and use a 4-layer CNN autoencoder... we sample frames at 2 fps... use the Adam optimizer with a constant learning rate of 1e 4 for training... A window size w of 50 inputs... a new layer (l + 1) is added to the stack after layer (l) has processed 50K inputs.