Noise-Robust De-Duplication at Scale
Authors: Emily Silcock, Luca D'Amico-Wong, Jinglin Yang, Melissa Dell
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | This study uses the unique timeliness of historical news wires to create a 27,210 document dataset, with 122,876 positive duplicate pairs, for studying noise-robust de-duplication. The study then develops and evaluates a range of de-duplication methods: hashing and N-gram overlap (which predominate in the literature), a contrastively trained bi-encoder, and a re-rank style approach combining a bi- and cross-encoder. The neural approaches significantly outperform hashing and N-gram overlap. |
| Researcher Affiliation | Academia | 1Department of Economics, Harvard University; Cambridge, MA, USA. 2Harvard College; Cambridge, MA, USA. 3Department of Economics, University of California Berkeley; Berkeley, CA, USA. 4Department of Economics, Harvard University and NBER; Cambridge, MA, USA. |
| Pseudocode | No | The paper describes the model architectures and methods in prose and with reference to existing work, but it does not include any pseudocode or clearly labeled algorithm blocks. |
| Open Source Code | Yes | The publicly available neural de-duplication models, available at https://github.com/dell-research-harvard/NEWS-COPY, can be applied to novel de-duplication problems. |
| Open Datasets | Yes | This study uses the unique timeliness of historical news wires to create a 27,210 document dataset, with 122,876 positive duplicate pairs, for studying noise-robust de-duplication. The resulting public NEWS-COPY dataset which contains 27,210 articles, comprising 122,876 positive duplicate pairs aims to encourage further study of robust de-duplication. |
| Dataset Splits | Yes | The 1955 sample is a validation set used to select hyperparemters for both the N-gram and neural methods. 1930 and 1974 are pooled to form the test set and used only to produce the results shown in this paper. In the full day samples, there are far more negative than positive pairs, as is generally the case in de-duplication problems, whereas the training data contain a more balanced sample. Table 1: This table provides summary statistics from the NEWS-COPY dataset, decomposed into the training sample and the full day evaluation data. |
| Hardware Specification | Yes | We conduct experiments on a 19 GB, 10 million article corpus, created by applying the same object detection model used to curate NEWS-COPY to millions of front page newspaper page scans. These experiments use a 32-Core 3.50 GHz AMD RYZEN Threadripper Pro 3975WX and a single NVIDIA A6000 GPU, a very modest setup for working with large text corpora. |
| Software Dependencies | No | The paper mentions several software components like "S-BERT MPNET model", "FAISS", "Datasketch’s MinHash LSH library", "Layout Parser", "Tesseract", "distil-RoBERTa classifier", "Adam W optimizer", "CuDF", "CuGraph", and "Sym Spell". However, it does not provide specific version numbers for any of these components, which are necessary for full reproducibility. |
| Experiment Setup | Yes | We contrastively train a symmetric bi-encoder to learn similar representations for near duplicate articles and dissimilar representations for non-duplicated articles. The learning rate is 2e-5 with 100% warm up and a batch size of 32. We use an Adam W optimizer, and the model is trained for 16 epochs. For the baseline re-ranking model, we choose a bi-encoder threshold of 0.92, optimized using the one-day validation sample. We use Ro BERTa-base (Liu et al., 2019) as the base language model, with a 2e-5 learning rate and an Adam W optimizer. It is trained for 5 epochs with 20% warmup and a batch size of 32. |