On the Robustness of Text Vectorizers

Authors: Rémi Catellier, Samuel Vaiter, Damien Garreau

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental These findings are exemplified through a series of numerical examples. (Abstract); 4.2. Experimental validation; 5.3. Experimental validation
Researcher Affiliation Academia 1Universit e Cˆote d Azur, CNRS, LJAD, France 2Inria, France 3CNRS, France.
Pseudocode No The paper presents mathematical proofs and formalisms but does not include any pseudocode or algorithm blocks.
Open Source Code Yes The code for all experiments of the paper is available at https://github.com/dgarreau/vectorizer-robustness.
Open Datasets Yes We considered movie reviews from the IMDB dataset as documents and the TF-IDF implementation from scikit-learn with L2 normalization. (Section 4.2); We considered again movie reviews from the IMDB dataset. (Section 5.3)
Dataset Splits No The paper mentions using a "subset of the IMDB dataset (10^3 reviews)" for training and describes how they perturb documents for experiments ("replaced 5 words", "increased the number of replaced words"), but it does not provide specific train/validation/test dataset splits (e.g., percentages or sample counts).
Hardware Specification No The paper does not specify any hardware details (e.g., CPU, GPU, memory, or cloud instances) used for running the experiments.
Software Dependencies No The paper mentions using "scikit-learn" and "gensim" but does not provide specific version numbers for these software dependencies.
Experiment Setup Yes We chose d = 50 as dimension of the embedding. We took ν = 5 as context size parameter. (Section 5.3)