Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
VLM in a flash: I/O-Efficient Sparsification of Vision-Language Model via Neuron Chunking
Authors: Kichang Yang, Seonjun Kim, Minjae Kim, Nairan Zhang, Chi Zhang, Youngki Lee
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | By aligning sparsification decisions with the underlying storage behavior, Neuron Chunking improves I/O efficiency by up to 4.65 and 5.76 on Jetson Orin Nano and Jetson AGX Orin, respectively. The code is available at https://github.com/snuhcs/vlm-flash. Evaluation on Jetson Orin AGX and Nano using open-source VLMs and standard benchmarks shows consistent improvement in the accuracy latency trade-off. |
| Researcher Affiliation | Collaboration | Kichang Yang Seoul National University EMAIL Seonjun Kim Seoul National University EMAIL Minjae Kim Seoul National University EMAIL Nairan Zhang Meta EMAIL Chi Zhang Amazon EMAIL Youngki Lee Seoul National University EMAIL |
| Pseudocode | Yes | Algorithm 1 provides a pseudocode of our multi-scale chunk selection method. |
| Open Source Code | Yes | The code is available at https://github.com/snuhcs/vlm-flash. |
| Open Datasets | Yes | For datasets, we use two multiple-choice video QA benchmarks Temp Compass [25] and NEx T-QA [47] (randomly sampled 3000 examples) and a video description dataset, Video Detail Caption [29]. LMMs-Lab. Videodetailcaption dataset. https://huggingface.co/datasets/lmms-lab/ Video Detail Caption, 2024. Accessed: 2025-05-15. |
| Dataset Splits | Yes | We apply TEAL s [24] profiling-based method to determine layer-wise sparsity levels for both the baseline and our method, using 25 out of 410 videos from the Temp Compass dataset (excluded from the main evaluation). In the reordering experiment, 20 videos are used for calibration and 5 for validation (i.e., result visualization). |
| Hardware Specification | Yes | Hardware. All experiments were performed on two different embedded device setups, representing low-end and high-end hardware environments: Jetson Orin Nano (8 GB memory) with SK Hynix Gold P31 SSD (peak sequential read: 3500 MB/s) Jetson Orin AGX (32 GB memory) with Samsung 990 Pro SSD (peak equential read: 7450 MB/s) |
| Software Dependencies | No | The paper mentions "Experiments use Linux direct I/O [23] with 6-thread thread-pool in C++." and "GPU Sorting. The importance-to-cost scores of all candidate chunks are transferred to GPU memory and sorted in descending order using Py Torch s GPU-accelerated sort." However, specific version numbers for these software components (e.g., PyTorch, CUDA, C++ compiler) are not provided. |
| Experiment Setup | Yes | We measure accuracy under a sparsity level from 0% to 70% in 10% increments. Each latency experiment was repeated 30 times under identical conditions. We report the median latency together with 95% confidence intervals computed by a 10000-sample non-parametric bootstrap (bias-corrected and accelerated, BCa). We enable jetson_clocks and disable swap to ensure stable measurement. Final hyperparameters are selected near the boundary between feasible (green) and infeasible (red) regions, with a small margin for safety. The selected settings are summarized in Table 2. |