Sequence-to-Sequence Learning via Shared Latent Representation
Authors: Xu Shen, Xinmei Tian, Jun Xing, Yong Rui, Dacheng Tao
AAAI 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our SLR model is validated on the Youtube2Text and MSR-VTT datasets, achieving superior performance on video-to-sentence task, and the first sentence-to-video results. |
| Researcher Affiliation | Collaboration | CAS Key Laboratory of Technology in Geo-Spatial Information Processing and Application Systems, University of Science and Technology of China, China Institute for Creative Technologies, University of Southern California Lenovo Research UBTECH Sydney Artificial Intelligence Institute, SIT, FEIT, University of Sydney, Australia |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide an explicit statement about releasing source code or a link to a code repository. |
| Open Datasets | Yes | Models are tested on the Microsoft Research Video Description Corpus (You Tube2Text) (Guadarrama et al. 2013) and the MSRVTT dataset (Xu et al. 2016). |
| Dataset Splits | Yes | The You Tube2Text dataset contains 1, 970 videos and about 40 English sentences for each video. Following previous works, we randomly split 1, 200 videos for training, 100 for validation and 670 videos for testing as in (Yao et al. 2015). The MSRVTT dataset contains 6, 513 videos for training, 497 videos for validation and 2, 990 videos for testing and each video corresponds to 20 descriptions. |
| Hardware Specification | No | The paper does not specify the exact hardware (e.g., GPU models, CPU types, memory) used for running the experiments. |
| Software Dependencies | No | The paper mentions software components like LSTMs, VGG, C3D, and Adam optimizer but does not provide specific version numbers for any of these or other software dependencies. |
| Experiment Setup | Yes | We use an initial learning rate 0.0001 on the You Tube2Text and 0.001 learning rate on the MSRVTT dataset for the full-model learning stage, and decay the learning rate by 10 in the partial-model learning stage. The full model learning stage is trained for 20 epochs on the You Tube2Text and 60 epochs on the MSRVTT dataset. The partial model learning stage is trained for 80/40 epochs on the You Tube2Text and the MSRVTT dataset, respectively. Finally, we fine-tune the learned model on the specific task (i.e. video-to-sentence) for 20 epochs. We train the model by Adam optimizer with 100 mini batch size. Gradients of parameters are clipped to maximum 35 L2 norm. |