Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

BirdSet: A Large-Scale Dataset for Audio Classification in Avian Bioacoustics

Authors: Lukas Rauch, Raphael Schwinger, Moritz Wirth, René Heinrich, Denis Huseljic, Marek Herde, Jonas Lange, Stefan Kahl, Bernhard Sick, Sven Tomforde, Christoph Scholz

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We benchmark six well-known DL models in multi-label classification across three distinct training scenarios and outline further evaluation use cases in audio classification. We host our dataset on Hugging Face for easy accessibility and offer an extensive codebase to reproduce our results.
Researcher Affiliation	Collaboration	1University of Kassel 2Kiel University 3Fraunhofer IEE 4TU Chemnitz EMAIL
Pseudocode	No	The paper describes methods and procedures in paragraph form and through figures, but no explicit pseudocode blocks or algorithms are presented.
Open Source Code	Yes	An extensive codebaseb with standardized training and evaluation protocols enables reproducing our results, supporting Bird Set s utility, and easing accessibility for newcomers. bhttps://github.com/DBD-research-group/Bird Set
Open Datasets	Yes	(1) We introduce the Bird Set dataset collectiona on Hugging Face (HF) (Lhoest et al., 2021), featuring about 520,000 unique global bird sound recordings from nearly 10,000 species with over 6,800 hours for training and over 400 hours of PAM recordings with 170,000 annotated vocalizations across eight unique project sites for evaluation. ahttps://huggingface.co/datasets/DBD-research-group/Bird Set
Dataset Splits	Yes	Training datasets (XCL, XCM) are used to train large-scale models, while soundscape recordings (e.g., PER, NES, UHH) are used for testing. This split reflects practical PAM scenarios, where models are trained on a broad dataset and evaluated on realistic, strongly-labeled soundscapes. Additionally, we provide a validation dataset (POW) for tuning hyperparameters or validating model results.
Hardware Specification	Yes	We primarily utilized an internal Slurm cluster equipped with NVIDIA A100 and V100 GPU servers from the IES group at the University of Kassel, predominantly using the NVIDIA A100 GPU servers for large-scale training. Collaboration with researchers from the University of Kiel and Fraunhofer IEE also involved using NVIDIA A100 GPUs within their internal compute clusters. Additionally, we conducted smaller-scale experiments on a workstation equipped with an NVIDIA RTX 4090 GPU and an AMD Ryzen 9 7950X CPU.
Software Dependencies	Yes	Torch Audiomentations (Jordal et al., 2024) (code, MIT license): We employ Torch Audiomentations for data augmentation, allowing us to enhance the variability and robustness of our training data.
Experiment Setup	Yes	Table 8 provides a detailed overview of the parameters used to generate baselines for all training scenarios. It serves as an addition to the main article.