Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
LiveStar: Live Streaming Assistant for Real-World Online Video Understanding
Authors: Zhenyu Yang, Kairui Zhang, Yuhang Hu, Bing Wang, Shengsheng Qian, Bin Wen, Fan Yang, Tingting Gao, Weiming Dong, Changsheng Xu
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments across three benchmarks demonstrate Live Star s state-of-the-art performance, achieving an average 19.5% improvement in semantic correctness with 18.1% reduced timing difference compared to existing online Video-LLMs, while improving FPS by 12.0% across all five Omni Star tasks. |
| Researcher Affiliation | Collaboration | Zhenyu Yang1,2, Kairui Zhang3, Yuhang Hu1, Bing Wang1, Shengsheng Qian1,2 , Bin Wen4, Fan Yang4, Tingting Gao4, Weiming Dong1,2, Changsheng Xu1,2,5 1Institute of Automation, Chinese Academy of Sciences, 2University of Chinese Academy of Sciences, 3Shanghai Tech University, 4Kuaishou Technology, 5Peng Cheng Laboratory |
| Pseudocode | Yes | Algorithm 1: Streaming Verification Decoding Input: Video frame stream {[Frmt]}T t=1 Output: Dynamically generated caption [Dec] Initialize [Dec], [Ctx] Initialize reference timestamp ti 0 for each incoming frame [Frmtj] do |
| Open Source Code | Yes | Our model and dataset can be accessed at https://github.com/yzy-bupt/Live Star. |
| Open Datasets | Yes | Our model and dataset can be accessed at https://github.com/yzy-bupt/Live Star. |
| Dataset Splits | Yes | The dataset comprises 20,137 expert-annotated video streams with temporally dense annotations, rigorously split into 19,137 training and 1,000 evaluation instances (200 per task) without overlap. |
| Hardware Specification | Yes | We conducted full fine-tuning of Live Star on 8 NVIDIA A800 GPUs. |
| Software Dependencies | Yes | Building upon Intern Video2.5 [10], the model consists of a vision encoder (Intern Vi T [1]), an MLP projector, and a large language model (Intern LM2.5-7B [73]). For the vision encoder, we employ Intern Vi T, a model pre-trained on a hybrid dataset combining image captioning and OCRspecific data... The extracted frame embeddings are then fed into an MLP projector to generate frame tokens, following the approach used in LLa VA-1.5 [74]. |
| Experiment Setup | Yes | During training, we conducted full fine-tuning of Live Star using a total of 83K data, which includes the Omni Star training set. We trained the models for 1 epoch with a learning rate of 4 10 5 using the Adam W optimizer (β1 = 0.9, β2 = 0.999, weight decay = 0.05), a per-device batch size of 1, and gradient accumulation over 4 steps to achieve an effective batch size of 32. We adopted cosine learning rate scheduling with a warmup ratio of 0.03. Input frames were uniformly resized to 448 448, with a patch downsampling ratio of 0.5. The vision encoder was frozen during training, while the MLP projector and language model components were fully updated. Each training sequence contains up to 8192 tokens, consisting of interleaved frame and language tokens following the Intern VL2.5 conversational template. We optimized the model using the standard autoregressive cross-entropy loss computed over the language tokens, where loss was computed only on assistant response tokens, and inter-frame language segments were excluded via our SCAM strategy. For inference, the tunable scaling factor in SVe D was set to 1.03 by default, the prune window W in peak-end memory compression was set to 40 frames, and the size of the paraphrased caption pool of streaming video-language alignment was set to M=1 by default for better temporal alignment. |