Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Bagged Regularized k-Distances for Anomaly Detection
Authors: Yuchao Cai, Hanfang Yang, Yuheng Ma, Hanyuan Hang
JMLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | On the practical side, we conduct numerical experiments to illustrate the insensitivity of the parameter selection of our algorithm compared with other state-of-the-art distance-based methods. Furthermore, our method achieves superior performance on real-world datasets with the introduced bagging technique compared to other approaches. [...] Section 5 presents numerical experiments. |
| Researcher Affiliation | Collaboration | Yuchao Cai EMAIL Department of Statistics and Data Science National University of Singapore 117546, Singapore [...] Hanyuan Hang EMAIL Hong Kong Research Institute Contemporary Amperex Technology (Hong Kong) Limited Hong Kong Science Park, New Territories, Hong Kong |
| Pseudocode | Yes | Algorithm 1: Surrogate Risk Minimization (SRM) [...] Algorithm 2: Bagged Regularized k-Distances for Anomaly Detection (BRDAD) |
| Open Source Code | No | The paper does not provide an explicit statement or link to the source code for the methodology described in this paper. |
| Open Datasets | Yes | To provide an extensive experimental evaluation, we use the latest anomaly detection benchmark repository named ADBench established by Han et al. (2022). |
| Dataset Splits | No | The paper mentions categorizing datasets into small, medium, and large based on sample size and sets the number of bagging rounds (B) accordingly. It also states, "In practice, when B is fixed, we randomly divide the data into B subsets, each containing either n/B or n/B + 1 samples." However, it does not provide specific percentages or absolute counts for training, validation, and test splits for the overall experimental evaluation on the ADBench datasets. |
| Hardware Specification | No | The paper discusses computational efficiency and parallel computation but does not specify any particular hardware components (e.g., CPU, GPU models, memory, or cloud instances) used for running the experiments. |
| Software Dependencies | No | The paper mentions using "the implementation of the Python package Py OD with its default parameters" for comparison methods like k-NN, LOF, and OCSVM, and "the author's implementation" for DTM and PIDForest. However, it does not specify version numbers for Python or any of these packages, which is necessary for reproducibility. |
| Experiment Setup | Yes | (i) BRDAD is our proposed algorithm, with details provided in Algorithm 2. The choice of B depends on the sample size: for n (0, 10, 000], (10, 000, 50, 000], and (50, 000, + ), we set B = 1, 5, and 10, respectively. [...] (ii) Distance-To-Measure (DTM) (Gu et al., 2019) [...] the number of neighbors k is fixed to be k = 0.03 sample size. [...] (v) Partial Identification Forest (PIDForest) (Gopalan et al., 2019) [...] with the number of trees T = 50, the number of buckets B = 5, and the depth of trees p = 10 suggested by the authors. |