RETSim: Resilient and Efficient Text Similarity
Authors: Marina Zhang, Owen Skipper Vallis, Aysegul Bumin, Tanay Vakharia, Elie Bursztein
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | This paper introduces RETSim (Resilient and Efficient Text Similarity), a lightweight, multilingual deep learning model trained to produce robust metric embeddings for near-duplicate text retrieval, clustering, and dataset deduplication tasks. We demonstrate that RETSim is significantly more robust and accurate than Min Hash and neural text embeddings, achieving new state-of-the-art performance on dataset deduplication, adversarial text retrieval benchmarks, and spam clustering tasks. |
| Researcher Affiliation | Collaboration | Marina Zhang1, Owen Vallis1, Aysegul Bumin*2, Tanay Vakharia1, Elie Bursztein1 Google1 University of Florida2 |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. Figure 1 is a model architecture diagram. |
| Open Source Code | Yes | RETSim and the W4NT3D benchmark are released under the MIT License at https://github.com/google/unisim. |
| Open Datasets | Yes | We use the multilingual C4 dataset (m C4) for raw text data and following (Xue et al., 2020)... |
| Dataset Splits | No | The paper does not provide specific train/validation/test dataset splits (e.g., percentages or sample counts) for the multilingual C4 dataset used for training RETSim. It mentions evaluation on other specific datasets like W4NT3D, NEWS-COPY, and CORE, which serve as test sets, but doesn't detail splits for model training. |
| Hardware Specification | Yes | Min Hash + LSH CPU AMD 7950 32 cores RETSim Onnx CPU AMD 7950 32 cores RETSim Tensor Flow GPU RTX 4090 RETSim Tensor Flow GPU NVIDIA H100 |
| Software Dependencies | No | The paper mentions software like TensorFlow and tools like USearch and Datasketch, but does not provide specific version numbers for these software dependencies (e.g., "Tensor Flow GPU" without a version). |
| Experiment Setup | Yes | We train RETSim using Multi-Similarity Loss (Wang et al., 2019) with α = 4, β = 40, λ = 0.5, and ϵ = 0.1. We train for 1 million steps with batch size = 1024. We use the LAMB optimizer (You et al., 2019) with a max learning rate of 0.001 and cosine decay. Detailed training hyperparameters are reported in Appendix A.1.2. |