Sign2GPT: Leveraging Large Language Models for Gloss-Free Sign Language Translation
Authors: Ryan Wong, Necati Cihan Camgoz, Richard Bowden
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate our approach on two public benchmark sign language translation datasets, namely RWTH-PHOENIX-Weather 2014T and CSL-Daily, and improve on state-of-the-art gloss-free translation performance with a significant margin. ... 4.3 COMPARISONS WITH STATE-OF-THE-ART METHODS ... 4.5 ABLATION STUDY |
| Researcher Affiliation | Collaboration | Ryan Wong1, Necati Cihan Camgoz2, Richard Bowden1 1University of Surrey, 2Meta Reality Labs |
| Pseudocode | No | The paper describes its methods in text and with diagrams (Figure 1, Figure 2, Figure 3) but does not include any explicit pseudocode blocks or algorithms. |
| Open Source Code | No | The paper mentions providing 'details of the training settings' and 'details of the libraries we used for the pretrained models in Appendix A.1' for reproducibility, but it does not state that the source code for the methodology described in the paper is openly available or provide a link to it. |
| Open Datasets | Yes | We evaluate our approach on two public benchmark sign language translation datasets, namely RWTH-PHOENIX-Weather 2014T and CSL-Daily. ... RWTH-PHOENIX-WEATHER-2014T (Phoenix14T) (Camgoz et al., 2018) is a German Sign Language dataset... CSL-Daily (Zhou et al., 2021) is a translation dataset... |
| Dataset Splits | No | The paper states 'We conduct our ablation studies on the Phoenix14T dataset, evaluating the BLEU-4 score on the development set.' implying the use of a validation/development set, but it does not specify the exact split percentages or sample counts for training, validation, and test sets, nor does it cite a predefined split. |
| Hardware Specification | Yes | The model is trained end-to-end with a batch size of 8 on two A100 GPUs |
| Software Dependencies | No | Appendix A.1 lists some software components used, such as Dino-V2, XGLM (pretrained weights), SpaCy, and FastText embeddings. However, it does not provide specific version numbers for these libraries, except for a general mention of 'Flash attention v2'. |
| Experiment Setup | Yes | The model is trained end-to-end with a batch size of 8 on two A100 GPUs, subsampling every second frame. ... The sign encoder is a 4 layer transformer with hidden dimension of 512, 8 attention heads and intermediate size of 2048. The temporal downsampling is applied after the 2nd layer. ... We employ the Adam optimizer ... with a learning rate of 3 10 4 and weight decay of 0.001. Training spans 100 epochs with gradient clipping of 1.0 and includes a one-cycle cosine learning rate scheduler ... with warmup for the initial 5 epochs. ... We initialize the prototype (τU) and time temperature (τT ) to 0.1. ... We utilize cross-entropy loss with label smoothing set to 0.1 during training. The Lo RA rank and alpha values are both set to 4. During inference, we employ a beam search with a width of 4. |