RETVec: Resilient and Efficient Text Vectorizer

Authors: Elie Bursztein, Marina Zhang, Owen Vallis, XINYU JIA, Alexey Kurakin

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper, we evaluate and compare RETVec to state-of-the-art vectorizers and word embeddings on popular model architectures and datasets. These comparisons demonstrate that RETVec leads to competitive, multilingual models that are significantly more resilient to typos and adversarial text attacks. RETVec is available under the Apache 2 license at https://github.com/google-research/retvec.
Researcher Affiliation Industry Elie Bursztein Google elieb@google.com Marina Zhang Google marinazh@google.com Owen Vallis Google ovallis@google.com Xinyu Jia Google jiaxinyu@google.com Alexey Kurakin Google kurakin@google.com
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks. It describes processes and architectures in text and diagrams, but not in a pseudocode format.
Open Source Code Yes RETVec is available under the Apache 2 license at https://github.com/google-research/retvec.
Open Datasets Yes We evaluate classification performance on four different datasets with drastically different dataset sizes, number of languages, classification tasks, and text lengths, as summarized in Table 2. For example, the MASSIVE [8] intent classification dataset... We used the Multilingual Amazon Reviews [19] dataset for the vectorization speed evaluations... AG News [33], Yelp Reviews (Polarity) [33].
Dataset Splits Yes We fine-tune all models for 20 epochs on GLUE using 3 different random seeds and report the best result for each dataset... Table 5: Results on GLUE dev sets.
Hardware Specification Yes All the experiments are run on a standard Google Cloud VM with 16 CPU cores and a V100 Nvidia GPU.
Software Dependencies Yes All models are implemented in Tensor Flow 2.11 and training is conducted on a Google Cloud VM using a single NVidia V100 GPU.
Experiment Setup Yes The RETVec model is pre-trained on a typo-augmented version of the 157-language fast Text words datasets [10]... The model is trained for 500k steps with batch size = 1024, using Adam with max learning rate = 0.001, β1 = 0.9, β2 = 0.999, and cosine decaying the learning rate to 0.0001 during training. Full training hyperparameters can be found in Appendix E... All models are trained with Adam optimizer with β1 = 0.9, β2 = 0.999 and a max learning rate of 5e-4... All models are trained for 100k steps with batch size 256 and cosine learning rate decay to 0.