RETSim: Resilient and Efficient Text Similarity

Authors: Marina Zhang, Owen Skipper Vallis, Aysegul Bumin, Tanay Vakharia, Elie Bursztein

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental This paper introduces RETSim (Resilient and Efficient Text Similarity), a lightweight, multilingual deep learning model trained to produce robust metric embeddings for near-duplicate text retrieval, clustering, and dataset deduplication tasks. We demonstrate that RETSim is significantly more robust and accurate than Min Hash and neural text embeddings, achieving new state-of-the-art performance on dataset deduplication, adversarial text retrieval benchmarks, and spam clustering tasks.
Researcher Affiliation Collaboration Marina Zhang1, Owen Vallis1, Aysegul Bumin*2, Tanay Vakharia1, Elie Bursztein1 Google1 University of Florida2
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks. Figure 1 is a model architecture diagram.
Open Source Code Yes RETSim and the W4NT3D benchmark are released under the MIT License at https://github.com/google/unisim.
Open Datasets Yes We use the multilingual C4 dataset (m C4) for raw text data and following (Xue et al., 2020)...
Dataset Splits No The paper does not provide specific train/validation/test dataset splits (e.g., percentages or sample counts) for the multilingual C4 dataset used for training RETSim. It mentions evaluation on other specific datasets like W4NT3D, NEWS-COPY, and CORE, which serve as test sets, but doesn't detail splits for model training.
Hardware Specification Yes Min Hash + LSH CPU AMD 7950 32 cores RETSim Onnx CPU AMD 7950 32 cores RETSim Tensor Flow GPU RTX 4090 RETSim Tensor Flow GPU NVIDIA H100
Software Dependencies No The paper mentions software like TensorFlow and tools like USearch and Datasketch, but does not provide specific version numbers for these software dependencies (e.g., "Tensor Flow GPU" without a version).
Experiment Setup Yes We train RETSim using Multi-Similarity Loss (Wang et al., 2019) with α = 4, β = 40, λ = 0.5, and ϵ = 0.1. We train for 1 million steps with batch size = 1024. We use the LAMB optimizer (You et al., 2019) with a max learning rate of 0.001 and cosine decay. Detailed training hyperparameters are reported in Appendix A.1.2.