On the Robustness of Text Vectorizers
Authors: Rémi Catellier, Samuel Vaiter, Damien Garreau
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | These findings are exemplified through a series of numerical examples. (Abstract); 4.2. Experimental validation; 5.3. Experimental validation |
| Researcher Affiliation | Academia | 1Universit e Cˆote d Azur, CNRS, LJAD, France 2Inria, France 3CNRS, France. |
| Pseudocode | No | The paper presents mathematical proofs and formalisms but does not include any pseudocode or algorithm blocks. |
| Open Source Code | Yes | The code for all experiments of the paper is available at https://github.com/dgarreau/vectorizer-robustness. |
| Open Datasets | Yes | We considered movie reviews from the IMDB dataset as documents and the TF-IDF implementation from scikit-learn with L2 normalization. (Section 4.2); We considered again movie reviews from the IMDB dataset. (Section 5.3) |
| Dataset Splits | No | The paper mentions using a "subset of the IMDB dataset (10^3 reviews)" for training and describes how they perturb documents for experiments ("replaced 5 words", "increased the number of replaced words"), but it does not provide specific train/validation/test dataset splits (e.g., percentages or sample counts). |
| Hardware Specification | No | The paper does not specify any hardware details (e.g., CPU, GPU, memory, or cloud instances) used for running the experiments. |
| Software Dependencies | No | The paper mentions using "scikit-learn" and "gensim" but does not provide specific version numbers for these software dependencies. |
| Experiment Setup | Yes | We chose d = 50 as dimension of the embedding. We took ν = 5 as context size parameter. (Section 5.3) |