reproducibilityindex.ai

LiteEval: A Coarse-to-Fine Framework for Resource Efficient Video Recognition

Authors: Zuxuan Wu, Caiming Xiong, Yu-Gang Jiang, Larry S. Davis

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments are conducted on two large-scale video benchmarks, FCVID and Activity Net, and the results demonstrate Lite Eval requires substantially less computation while offering excellent classiﬁcation accuracy for both online and ofﬂine predictions.
Researcher Affiliation	Collaboration	1 University of Maryland, 2 Salesforce Research, 3 Fudan University
Pseudocode	No	The paper describes the model components and their interactions using text and mathematical equations, but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code	No	The paper does not contain any explicit statements about releasing source code or provide links to a code repository for the described methodology.
Open Datasets	Yes	We adopt two large-scale video classiﬁcation benchmarks to evaluate the performance of LITEEVAL, i.e., FCVID and ACTIVITYNET. FCVID (Fudan-Columbia Video Dataset) [18] contains 91, 223 videos collected from You Tube belonging to 239 classes... ACTIVITYNET [12] consists of videos that are action/activity-oriented...
Dataset Splits	Yes	The average duration of videos in FCVID is 167 seconds and the dataset is split into a training set with 45, 611 videos and a testing set with 45, 612 videos. (...) Here, we use the v1.3 split with a training set of 10, 024 videos, a validation set of 4, 926 videos and a testing set of 5, 044 videos.
Hardware Specification	Yes	We implement the framework using Py Torch on one NVIDIA P6000 GPU and adopts Adam [40] as the optimizer...
Software Dependencies	No	The paper mentions using 'Py Torch' and 'Adam' as the optimizer but does not specify their version numbers, which is necessary for reproducibility.
Experiment Setup	Yes	We extract coarse features with a Mobile Netv2 [27] model using spatially downsampled video frames (i.e., 112 112). ... we use a Res Net-101 model... and set λ to 2. For ACTIVITYNET, we train with a batch size of 128 and the coarse LSTM and the ﬁne LSTM respectively contain 64 and 512 hidden units, while for FCVID, there are 512 and 2, 048 hidden units in the coarse and ﬁne LSTM respectively and the batch size is 256. ... adopts Adam [40] as the optimizer with a ﬁxed learning rate of 1e 4