reproducibilityindex.ai

RETVec: Resilient and Efficient Text Vectorizer

Authors: Elie Bursztein, Marina Zhang, Owen Vallis, XINYU JIA, Alexey Kurakin

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this paper, we evaluate and compare RETVec to state-of-the-art vectorizers and word embeddings on popular model architectures and datasets. These comparisons demonstrate that RETVec leads to competitive, multilingual models that are signiﬁcantly more resilient to typos and adversarial text attacks. RETVec is available under the Apache 2 license at https://github.com/google-research/retvec.
Researcher Affiliation	Industry	Elie Bursztein Google elieb@google.com Marina Zhang Google marinazh@google.com Owen Vallis Google ovallis@google.com Xinyu Jia Google jiaxinyu@google.com Alexey Kurakin Google kurakin@google.com
Pseudocode	No	The paper does not contain structured pseudocode or algorithm blocks. It describes processes and architectures in text and diagrams, but not in a pseudocode format.
Open Source Code	Yes	RETVec is available under the Apache 2 license at https://github.com/google-research/retvec.
Open Datasets	Yes	We evaluate classiﬁcation performance on four different datasets with drastically different dataset sizes, number of languages, classiﬁcation tasks, and text lengths, as summarized in Table 2. For example, the MASSIVE [8] intent classiﬁcation dataset... We used the Multilingual Amazon Reviews [19] dataset for the vectorization speed evaluations... AG News [33], Yelp Reviews (Polarity) [33].
Dataset Splits	Yes	We ﬁne-tune all models for 20 epochs on GLUE using 3 different random seeds and report the best result for each dataset... Table 5: Results on GLUE dev sets.
Hardware Specification	Yes	All the experiments are run on a standard Google Cloud VM with 16 CPU cores and a V100 Nvidia GPU.
Software Dependencies	Yes	All models are implemented in Tensor Flow 2.11 and training is conducted on a Google Cloud VM using a single NVidia V100 GPU.
Experiment Setup	Yes	The RETVec model is pre-trained on a typo-augmented version of the 157-language fast Text words datasets [10]... The model is trained for 500k steps with batch size = 1024, using Adam with max learning rate = 0.001, β1 = 0.9, β2 = 0.999, and cosine decaying the learning rate to 0.0001 during training. Full training hyperparameters can be found in Appendix E... All models are trained with Adam optimizer with β1 = 0.9, β2 = 0.999 and a max learning rate of 5e-4... All models are trained for 100k steps with batch size 256 and cosine learning rate decay to 0.