LiteEval: A Coarse-to-Fine Framework for Resource Efficient Video Recognition
Authors: Zuxuan Wu, Caiming Xiong, Yu-Gang Jiang, Larry S. Davis
NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments are conducted on two large-scale video benchmarks, FCVID and Activity Net, and the results demonstrate Lite Eval requires substantially less computation while offering excellent classification accuracy for both online and offline predictions. |
| Researcher Affiliation | Collaboration | 1 University of Maryland, 2 Salesforce Research, 3 Fudan University |
| Pseudocode | No | The paper describes the model components and their interactions using text and mathematical equations, but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks. |
| Open Source Code | No | The paper does not contain any explicit statements about releasing source code or provide links to a code repository for the described methodology. |
| Open Datasets | Yes | We adopt two large-scale video classification benchmarks to evaluate the performance of LITEEVAL, i.e., FCVID and ACTIVITYNET. FCVID (Fudan-Columbia Video Dataset) [18] contains 91, 223 videos collected from You Tube belonging to 239 classes... ACTIVITYNET [12] consists of videos that are action/activity-oriented... |
| Dataset Splits | Yes | The average duration of videos in FCVID is 167 seconds and the dataset is split into a training set with 45, 611 videos and a testing set with 45, 612 videos. (...) Here, we use the v1.3 split with a training set of 10, 024 videos, a validation set of 4, 926 videos and a testing set of 5, 044 videos. |
| Hardware Specification | Yes | We implement the framework using Py Torch on one NVIDIA P6000 GPU and adopts Adam [40] as the optimizer... |
| Software Dependencies | No | The paper mentions using 'Py Torch' and 'Adam' as the optimizer but does not specify their version numbers, which is necessary for reproducibility. |
| Experiment Setup | Yes | We extract coarse features with a Mobile Netv2 [27] model using spatially downsampled video frames (i.e., 112 112). ... we use a Res Net-101 model... and set λ to 2. For ACTIVITYNET, we train with a batch size of 128 and the coarse LSTM and the fine LSTM respectively contain 64 and 512 hidden units, while for FCVID, there are 512 and 2, 048 hidden units in the coarse and fine LSTM respectively and the batch size is 256. ... adopts Adam [40] as the optimizer with a fixed learning rate of 1e 4 |