Video Generation From Text
Authors: Yitong Li, Martin Min, Dinghan Shen, David Carlson, Lawrence Carin
AAAI 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results show that the proposed framework generates plausible and diverse short-duration smooth videos, while accurately reflecting the input text information. It significantly outperforms baseline models that directly adapt text-to-image generation procedures to produce videos. Performance is evaluated both visually and by adapting the inception score used to evaluate image generation in GANs. |
| Researcher Affiliation | Collaboration | Duke University, Durham, NC, United States, 27708 NEC Laboratories America, Princeton, NJ, United States, 08540 {yitong.li, dinghan.shen, david.carlson, lcarin}@duke.edu, renqiang@nec-labs.com |
| Pseudocode | No | The paper describes the model components and training process textually and with diagrams, but does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper provides a link to supplemental files for generated movie clips but does not state that source code for the methodology is released or provide a link to a code repository. |
| Open Datasets | Yes | Clean videos from the Kinetics Human Action Video Dataset (Kinetics) (Kay et al. 2017) are additionally used with the steps described above to further expand the dataset. Using the You Tube8M (Abu-El-Haija et al. 2016) dataset for this process is also feasible, but the Kinetic dataset has cleaner videos than You Tube8M. |
| Dataset Splits | Yes | In the training process, the whole video dataset is split with ratios 7 : 1 : 2 to create training, validation and test sets. |
| Hardware Specification | No | The paper does not provide specific hardware details (exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments. |
| Software Dependencies | No | The paper mentions using Adam as an optimizer and SIFT/RANSAC for preprocessing, but does not provide specific software names with version numbers for libraries or tools used to implement the models or experiments. |
| Experiment Setup | Yes | Each video uses a sampling rate of 25 frames per second. SIFT key points are extracted for each frame, and the RANSAC algorithm determines whether continuous frames have enough key-point overlap (Lowe 1999). Each video clip is limited to 32 frames, with 64x64 resolution. Pixel values are normalized to the range of [-1, 1], matching the use of the tanh function in the network output layer. The final objective function is given by L = γ1LCV AE + γ2LGAN + γ3LRECONS, where γ1, γ2 and γ3 are scalar weights for each loss term. In the experiments, γ1 = γ2 = 1 and γ3 = 10, making the values of the three terms comparable empirically. |