RETVec: Resilient and Efficient Text Vectorizer
Authors: Elie Bursztein, Marina Zhang, Owen Vallis, XINYU JIA, Alexey Kurakin
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this paper, we evaluate and compare RETVec to state-of-the-art vectorizers and word embeddings on popular model architectures and datasets. These comparisons demonstrate that RETVec leads to competitive, multilingual models that are significantly more resilient to typos and adversarial text attacks. RETVec is available under the Apache 2 license at https://github.com/google-research/retvec. |
| Researcher Affiliation | Industry | Elie Bursztein Google elieb@google.com Marina Zhang Google marinazh@google.com Owen Vallis Google ovallis@google.com Xinyu Jia Google jiaxinyu@google.com Alexey Kurakin Google kurakin@google.com |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. It describes processes and architectures in text and diagrams, but not in a pseudocode format. |
| Open Source Code | Yes | RETVec is available under the Apache 2 license at https://github.com/google-research/retvec. |
| Open Datasets | Yes | We evaluate classification performance on four different datasets with drastically different dataset sizes, number of languages, classification tasks, and text lengths, as summarized in Table 2. For example, the MASSIVE [8] intent classification dataset... We used the Multilingual Amazon Reviews [19] dataset for the vectorization speed evaluations... AG News [33], Yelp Reviews (Polarity) [33]. |
| Dataset Splits | Yes | We fine-tune all models for 20 epochs on GLUE using 3 different random seeds and report the best result for each dataset... Table 5: Results on GLUE dev sets. |
| Hardware Specification | Yes | All the experiments are run on a standard Google Cloud VM with 16 CPU cores and a V100 Nvidia GPU. |
| Software Dependencies | Yes | All models are implemented in Tensor Flow 2.11 and training is conducted on a Google Cloud VM using a single NVidia V100 GPU. |
| Experiment Setup | Yes | The RETVec model is pre-trained on a typo-augmented version of the 157-language fast Text words datasets [10]... The model is trained for 500k steps with batch size = 1024, using Adam with max learning rate = 0.001, β1 = 0.9, β2 = 0.999, and cosine decaying the learning rate to 0.0001 during training. Full training hyperparameters can be found in Appendix E... All models are trained with Adam optimizer with β1 = 0.9, β2 = 0.999 and a max learning rate of 5e-4... All models are trained for 100k steps with batch size 256 and cosine learning rate decay to 0. |