Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
YouTube-SL-25: A Large-Scale, Open-Domain Multilingual Sign Language Parallel Corpus
Authors: Garrett Tanzer, Biao Zhang
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We provide baselines for sign-to-text tasks using a uniļ¬ed multilingual multitask model based on T5 and report scores on benchmarks across 4 sign languages. The results demonstrate that multilingual transfer beneļ¬ts both higherand lower-resource sign languages within You Tube-SL-25. |
| Researcher Affiliation | Industry | Garrett Tanzer Google Deep Mind Biao Zhang Google Deep Mind Correspondence to EMAIL. |
| Pseudocode | No | The paper includes 'Figure 3: Uniļ¬ed document-level sign-to-text training, extended for multilinguality', which is a diagram illustrating the task format and token structure, not a block of pseudocode or a clearly labeled algorithm. |
| Open Source Code | No | We release the You Tube-SL-25 video IDs under CC BY 4.0 at this link. Note that this license only applies to the video IDs and ISO 639-3 language codes, which we selected and labelled. The underlying video and caption content, as with all datasets consisting of You Tube video IDs, is subject to different licenses and should be accessed/used in accordance with the You Tube Terms of Service. |
| Open Datasets | Yes | We release the You Tube-SL-25 video IDs under CC BY 4.0 at this link. We publicly release the You Tube-SL-25 video IDs at this link. |
| Dataset Splits | Yes | For translation, the model is separately ļ¬netuned for each dataset, checkpoint selected based on BLEU on the validation set. For sign language identiļ¬cation, zero-shot scores mean that the model is brieļ¬y ļ¬netuned on You Tube-SL-25 rebalanced to the 4 sign languages with equal weight, and ļ¬netuned scores mean that the model is ļ¬netuned on an equally weighted mixture of the benchmarks training sets. We don t ļ¬netune on FLEURS-ASL, so the ļ¬netuned langid scores are after ļ¬netuning on How2Sign. |
| Hardware Specification | Yes | We pretrained our two T5v1.1 Small models (on Youtube-SL-25 s ASL and Full sets) on 64 TPUv3s for 210k and 430k steps respectively (switching from pure caption-level training to 1:1 captionlevel:random clip-level training once the model appeared to have converged, then stopping again after re-convergence, both according to BLEU on the How2Sign val set, like in FLEURS-ASL [34]). Each 1k steps took about 8 minutes to train. We also pretrained an m T5 Small model for about 600k steps, which was underperforming so we didn t run the complete set of experiments for it. |
| Software Dependencies | Yes | We train all of our models with Adafactor [30] with base learning rate 0.001. We pretrained our two T5v1.1 Small models (on Youtube-SL-25 s ASL and Full sets)... We tried to change the pretrained model from T5-v1.1 Small [26] to m T5 Small [37] so that languages besides English could beneļ¬t from pretraining and better tokenization, but in initial experiments m T5 took about 1/3 more steps to converge and achieved worse results. |
| Experiment Setup | Yes | We train all of our models with Adafactor [30] with base learning rate 0.001. We pretrained our two T5v1.1 Small models (on Youtube-SL-25 s ASL and Full sets) on 64 TPUv3s for 210k and 430k steps respectively... We ļ¬netuned the sentence-level translation models on 16 TPUv3s with a batch size of 32 until convergence; this took about 10k steps for WMT23 SS DSGS and at most 2.5k steps for the other datasets. We ļ¬netuned the language identiļ¬cation models on a mixture of data for the four languages with equal weight... We used 16 TPUv3s with a batch size of 32 until convergence, with up to 3k steps. |