Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Memorisation in Machine Learning: A Survey of Results

Authors: Dmitrii Usynin, Moritz Knolle, Georgios Kaissis

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical In this work we consider a broad range of previous definitions and perspectives on memorisation in ML, discuss their interplay with model generalisation and their implications of these phenomena on data privacy. We then propose a framework to reason over what memorisation means in the context of ML training under the prism of individual sample s influence on the model. Moreover, we systematise methods allowing practitioners to detect the occurrence of memorisation or quantify it and contextualise our findings in a broad range of ML learning settings. Finally, we discuss memorisation in the context of privacy attacks, differential privacy and adversarial actors. In this paper, we attempt the aforementioned systematisation. Our work is structured as follows:
Researcher Affiliation Academia Dmitrii Usynin EMAIL Department of Computing, Imperial College London Institute for AI in Medicine, Technical University of Munich Moritz Knolle EMAIL Institute for AI in Medicine, Technical University of Munich Konrad Zuse School for Excellence in Reliable AI Georgios Kaissis EMAIL Institute for AI in Medicine, Technical University of Munich, Institute for Machine Learning in Biomedical Imaging, Helmholtz Munich
Pseudocode No The paper describes methods and concepts in prose and refers to mathematical equations and figures to illustrate them, but it does not contain any structured pseudocode or algorithm blocks. For example, methods like 'Re-training-based methods approximate Eq. (4) directly through simple Monte Carlo sampling' are described textually without an algorithm block.
Open Source Code No The paper is a survey of existing methods and does not describe a novel methodology for which code would be provided. It refers to various methods and their original publications, but does not offer its own source code for the survey content.
Open Datasets No The paper is a survey and discusses the use of datasets in various prior works (e.g., Image Net (Feldman & Zhang, 2020); large-scale vision and text datasets from Feldman & Zhang (2020); Zhang et al. (2021a)), but it does not conduct its own experiments or provide access information for datasets it directly uses.
Dataset Splits No The paper is a survey of existing research and does not conduct its own experiments. Therefore, it does not define or utilize specific training/test/validation dataset splits.
Hardware Specification No The paper is a survey and does not conduct original experiments. Therefore, it does not specify any hardware used for running experiments.
Software Dependencies No The paper is a survey and does not present original experimental work requiring specific software dependencies with version numbers.
Experiment Setup No The paper is a survey of existing research and does not describe an original experimental setup, hyperparameters, or system-level training settings.