LaSe-E2V: Towards Language-guided Semantic-aware Event-to-Video Reconstruction
Authors: Kanghao Chen, Hangyu Li, Jiazhou Zhou, Zeyu Wang, Lin Wang
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on three datasets covering diverse challenging scenarios (e.g., fast motion, low light) demonstrate the superiority of our method. |
| Researcher Affiliation | Academia | Kanghao Chen1 Hangyu Li1 Jiazhou Zhou1 Zeyu Wang2,1,3 Lin Wang1,2,3 1AI Thrust, 2CMA Thrust, HKUST(GZ) 3Dept. of CSE, HKUST kchen879@connect.hkust-gz.edu.cn, linwang@ust.hk |
| Pseudocode | No | The paper describes the proposed framework and methods using text and figures, but it does not include a formal pseudocode or algorithm block. |
| Open Source Code | No | Project Page: https://vlislab22.github.io/La Se-E2V/ and We don t provide open access to data and code in the supplemental material. in the NeurIPS Paper Checklist, Question 5. |
| Open Datasets | Yes | We train our pipeline using both synthetic and real-world datasets. For synthetic data, following prior arts [50, 34], we generate event and video sequences from the MS-COCO dataset [33] using the v2e [25] event simulator... We evaluate our model on Event Camera Dataset (ECD) [43], Multi Vehicle Stereo Event Camera (MVSEC) dataset [76] and High-Quality Frames (HQF) dataset [57]. |
| Dataset Splits | No | We train our pipeline using both synthetic and real-world datasets... We evaluate our model on Event Camera Dataset (ECD) [43], Multi Vehicle Stereo Event Camera (MVSEC) dataset [76] and High-Quality Frames (HQF) dataset [57]. The paper specifies datasets used for training and evaluation, but does not provide explicit train/test/validation *splits* (percentages or sample counts) for any single dataset, nor does it explicitly detail a separate validation set. |
| Hardware Specification | Yes | The model is trained with the proposed loss across all U-Net parameters, with a batch size of 3 and a learning rate of 5e-5 for 150k steps on 8 NVIDIA V100 GPUs. |
| Software Dependencies | No | Based on Stable Diffusion 2.1-base [52], we use a text-guided video diffusion model [51] to initialize our model... The paper mentions specific models and frameworks but does not provide explicit version numbers for software dependencies like PyTorch, Python, etc. |
| Experiment Setup | Yes | For each training video clip, we sample 16 frames and the corresponding event streams, with an interval of 1 v 3 frames. The input size is adapted to 256 256. Following previous methods [64, 57], the data augmentation strategies include Gaussian noise, random flipping, and random pause. The value λ is set to 0.01 for all experiments. The model is trained with the proposed loss across all U-Net parameters, with a batch size of 3 and a learning rate of 5e-5 for 150k steps on 8 NVIDIA V100 GPUs. and During training, we randomly drop input text prompts with a probability of 0.1 to enable classifier-free guidance [22]. For the reconstruction of the first clips and the accumulation error of the autoregressive pipeline, we randomly drop the first frame as the condition with 0.4 probability. During inference, we employ the DDIM sampler [56] with 50 steps and classifier-free guidance with a text guidance scale of w = 5 to sample videos. |