Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

YouTube-SL-25: A Large-Scale, Open-Domain Multilingual Sign Language Parallel Corpus

Authors: Garrett Tanzer, Biao Zhang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We provide baselines for sign-to-text tasks using a uniﬁed multilingual multitask model based on T5 and report scores on benchmarks across 4 sign languages. The results demonstrate that multilingual transfer beneﬁts both higherand lower-resource sign languages within You Tube-SL-25.
Researcher Affiliation	Industry	Garrett Tanzer Google Deep Mind Biao Zhang Google Deep Mind Correspondence to EMAIL.
Pseudocode	No	The paper includes 'Figure 3: Uniﬁed document-level sign-to-text training, extended for multilinguality', which is a diagram illustrating the task format and token structure, not a block of pseudocode or a clearly labeled algorithm.
Open Source Code	No	We release the You Tube-SL-25 video IDs under CC BY 4.0 at this link. Note that this license only applies to the video IDs and ISO 639-3 language codes, which we selected and labelled. The underlying video and caption content, as with all datasets consisting of You Tube video IDs, is subject to different licenses and should be accessed/used in accordance with the You Tube Terms of Service.
Open Datasets	Yes	We release the You Tube-SL-25 video IDs under CC BY 4.0 at this link. We publicly release the You Tube-SL-25 video IDs at this link.
Dataset Splits	Yes	For translation, the model is separately ﬁnetuned for each dataset, checkpoint selected based on BLEU on the validation set. For sign language identiﬁcation, zero-shot scores mean that the model is brieﬂy ﬁnetuned on You Tube-SL-25 rebalanced to the 4 sign languages with equal weight, and ﬁnetuned scores mean that the model is ﬁnetuned on an equally weighted mixture of the benchmarks training sets. We don t ﬁnetune on FLEURS-ASL, so the ﬁnetuned langid scores are after ﬁnetuning on How2Sign.
Hardware Specification	Yes	We pretrained our two T5v1.1 Small models (on Youtube-SL-25 s ASL and Full sets) on 64 TPUv3s for 210k and 430k steps respectively (switching from pure caption-level training to 1:1 captionlevel:random clip-level training once the model appeared to have converged, then stopping again after re-convergence, both according to BLEU on the How2Sign val set, like in FLEURS-ASL [34]). Each 1k steps took about 8 minutes to train. We also pretrained an m T5 Small model for about 600k steps, which was underperforming so we didn t run the complete set of experiments for it.
Software Dependencies	Yes	We train all of our models with Adafactor [30] with base learning rate 0.001. We pretrained our two T5v1.1 Small models (on Youtube-SL-25 s ASL and Full sets)... We tried to change the pretrained model from T5-v1.1 Small [26] to m T5 Small [37] so that languages besides English could beneﬁt from pretraining and better tokenization, but in initial experiments m T5 took about 1/3 more steps to converge and achieved worse results.
Experiment Setup	Yes	We train all of our models with Adafactor [30] with base learning rate 0.001. We pretrained our two T5v1.1 Small models (on Youtube-SL-25 s ASL and Full sets) on 64 TPUv3s for 210k and 430k steps respectively... We ﬁnetuned the sentence-level translation models on 16 TPUv3s with a batch size of 32 until convergence; this took about 10k steps for WMT23 SS DSGS and at most 2.5k steps for the other datasets. We ﬁnetuned the language identiﬁcation models on a mixture of data for the four languages with equal weight... We used 16 TPUv3s with a batch size of 32 until convergence, with up to 3k steps.