reproducibilityindex.ai

On the Robustness of Text Vectorizers

Authors: Rémi Catellier, Samuel Vaiter, Damien Garreau

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	These findings are exemplified through a series of numerical examples. (Abstract); 4.2. Experimental validation; 5.3. Experimental validation
Researcher Affiliation	Academia	1Universit e Cˆote d Azur, CNRS, LJAD, France 2Inria, France 3CNRS, France.
Pseudocode	No	The paper presents mathematical proofs and formalisms but does not include any pseudocode or algorithm blocks.
Open Source Code	Yes	The code for all experiments of the paper is available at https://github.com/dgarreau/vectorizer-robustness.
Open Datasets	Yes	We considered movie reviews from the IMDB dataset as documents and the TF-IDF implementation from scikit-learn with L2 normalization. (Section 4.2); We considered again movie reviews from the IMDB dataset. (Section 5.3)
Dataset Splits	No	The paper mentions using a "subset of the IMDB dataset (10^3 reviews)" for training and describes how they perturb documents for experiments ("replaced 5 words", "increased the number of replaced words"), but it does not provide specific train/validation/test dataset splits (e.g., percentages or sample counts).
Hardware Specification	No	The paper does not specify any hardware details (e.g., CPU, GPU, memory, or cloud instances) used for running the experiments.
Software Dependencies	No	The paper mentions using "scikit-learn" and "gensim" but does not provide specific version numbers for these software dependencies.
Experiment Setup	Yes	We chose d = 50 as dimension of the embedding. We took ν = 5 as context size parameter. (Section 5.3)