reproducibilityindex.ai

LaSe-E2V: Towards Language-guided Semantic-aware Event-to-Video Reconstruction

Authors: Kanghao Chen, Hangyu Li, Jiazhou Zhou, Zeyu Wang, Lin Wang

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on three datasets covering diverse challenging scenarios (e.g., fast motion, low light) demonstrate the superiority of our method.
Researcher Affiliation	Academia	Kanghao Chen1 Hangyu Li1 Jiazhou Zhou1 Zeyu Wang2,1,3 Lin Wang1,2,3 1AI Thrust, 2CMA Thrust, HKUST(GZ) 3Dept. of CSE, HKUST kchen879@connect.hkust-gz.edu.cn, linwang@ust.hk
Pseudocode	No	The paper describes the proposed framework and methods using text and figures, but it does not include a formal pseudocode or algorithm block.
Open Source Code	No	Project Page: https://vlislab22.github.io/La Se-E2V/ and We don t provide open access to data and code in the supplemental material. in the NeurIPS Paper Checklist, Question 5.
Open Datasets	Yes	We train our pipeline using both synthetic and real-world datasets. For synthetic data, following prior arts [50, 34], we generate event and video sequences from the MS-COCO dataset [33] using the v2e [25] event simulator... We evaluate our model on Event Camera Dataset (ECD) [43], Multi Vehicle Stereo Event Camera (MVSEC) dataset [76] and High-Quality Frames (HQF) dataset [57].
Dataset Splits	No	We train our pipeline using both synthetic and real-world datasets... We evaluate our model on Event Camera Dataset (ECD) [43], Multi Vehicle Stereo Event Camera (MVSEC) dataset [76] and High-Quality Frames (HQF) dataset [57]. The paper specifies datasets used for training and evaluation, but does not provide explicit train/test/validation splits (percentages or sample counts) for any single dataset, nor does it explicitly detail a separate validation set.
Hardware Specification	Yes	The model is trained with the proposed loss across all U-Net parameters, with a batch size of 3 and a learning rate of 5e-5 for 150k steps on 8 NVIDIA V100 GPUs.
Software Dependencies	No	Based on Stable Diffusion 2.1-base [52], we use a text-guided video diffusion model [51] to initialize our model... The paper mentions specific models and frameworks but does not provide explicit version numbers for software dependencies like PyTorch, Python, etc.
Experiment Setup	Yes	For each training video clip, we sample 16 frames and the corresponding event streams, with an interval of 1 v 3 frames. The input size is adapted to 256 256. Following previous methods [64, 57], the data augmentation strategies include Gaussian noise, random flipping, and random pause. The value λ is set to 0.01 for all experiments. The model is trained with the proposed loss across all U-Net parameters, with a batch size of 3 and a learning rate of 5e-5 for 150k steps on 8 NVIDIA V100 GPUs. and During training, we randomly drop input text prompts with a probability of 0.1 to enable classifier-free guidance [22]. For the reconstruction of the first clips and the accumulation error of the autoregressive pipeline, we randomly drop the first frame as the condition with 0.4 probability. During inference, we employ the DDIM sampler [56] with 50 steps and classifier-free guidance with a text guidance scale of w = 5 to sample videos.