reproducibilityindex.ai

RETSim: Resilient and Efficient Text Similarity

Authors: Marina Zhang, Owen Skipper Vallis, Aysegul Bumin, Tanay Vakharia, Elie Bursztein

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	This paper introduces RETSim (Resilient and Efﬁcient Text Similarity), a lightweight, multilingual deep learning model trained to produce robust metric embeddings for near-duplicate text retrieval, clustering, and dataset deduplication tasks. We demonstrate that RETSim is signiﬁcantly more robust and accurate than Min Hash and neural text embeddings, achieving new state-of-the-art performance on dataset deduplication, adversarial text retrieval benchmarks, and spam clustering tasks.
Researcher Affiliation	Collaboration	Marina Zhang1, Owen Vallis1, Aysegul Bumin*2, Tanay Vakharia1, Elie Bursztein1 Google1 University of Florida2
Pseudocode	No	The paper does not contain structured pseudocode or algorithm blocks. Figure 1 is a model architecture diagram.
Open Source Code	Yes	RETSim and the W4NT3D benchmark are released under the MIT License at https://github.com/google/unisim.
Open Datasets	Yes	We use the multilingual C4 dataset (m C4) for raw text data and following (Xue et al., 2020)...
Dataset Splits	No	The paper does not provide specific train/validation/test dataset splits (e.g., percentages or sample counts) for the multilingual C4 dataset used for training RETSim. It mentions evaluation on other specific datasets like W4NT3D, NEWS-COPY, and CORE, which serve as test sets, but doesn't detail splits for model training.
Hardware Specification	Yes	Min Hash + LSH CPU AMD 7950 32 cores RETSim Onnx CPU AMD 7950 32 cores RETSim Tensor Flow GPU RTX 4090 RETSim Tensor Flow GPU NVIDIA H100
Software Dependencies	No	The paper mentions software like TensorFlow and tools like USearch and Datasketch, but does not provide specific version numbers for these software dependencies (e.g., "Tensor Flow GPU" without a version).
Experiment Setup	Yes	We train RETSim using Multi-Similarity Loss (Wang et al., 2019) with α = 4, β = 40, λ = 0.5, and ϵ = 0.1. We train for 1 million steps with batch size = 1024. We use the LAMB optimizer (You et al., 2019) with a max learning rate of 0.001 and cosine decay. Detailed training hyperparameters are reported in Appendix A.1.2.