Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Revisiting Data Augmentation for Ultrasound Images

Authors: Adam Tupper, Christian Gagné

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	This work addresses this gap by analyzing the effectiveness of different augmentation techniques at improving model performance across a wide range of ultrasound image analysis tasks. To achieve this, we introduce a new standardized benchmark of 14 ultrasound image classification and semantic segmentation tasks from 10 different sources and covering 11 body regions. Our results demonstrate that many of the augmentations commonly used for tasks on natural images are also effective on ultrasound images, even more so than augmentations developed specifically for ultrasound images in some cases.
Researcher Affiliation	Academia	Adam Tupper EMAIL Institut Intelligence et Données (IID), Université Laval, Mila Christian Gagné EMAIL Institut Intelligence et Données (IID), Université Laval Canada-CIFAR AI Chair, Mila
Pseudocode	No	The paper describes the implementation of augmentations using mathematical formulas and textual descriptions, for example for Depth Attenuation, Haze Artifact Addition, Gaussian Shadow, and Speckle Reduction, but it does not present any explicitly labeled 'Pseudocode' or 'Algorithm' block.
Open Source Code	Yes	Our code, documentation and benchmark are available at https://github.com/adamtupper/ultrasound-augmentation.
Open Datasets	Yes	To assess different augmentations for ultrasound image analysis, we created a benchmark of ultrasound image analysis tasks. The benchmark includes 14 tasks (7 classification, 7 segmentation) from 10 public datasets of 2D fan-shape ultrasound images captured with either convex and phased array ultrasound probes.
Dataset Splits	Yes	Except for cases where data splits are predefined by the dataset s original authors, we split each dataset into training, validation, and test images using a 7:1:2 split, using patient identifiers where applicable to ensure that there is no patient overlap between the sets.
Hardware Specification	No	The Acknowledgements section mentions support from Calcul Québec and the Digital Research Alliance of Canada, which are High-Performance Computing (HPC) providers. However, it does not specify any particular GPU models, CPU models, or other detailed hardware specifications used for the experiments.
Software Dependencies	No	The paper mentions several software components like Optuna (Akiba et al., 2019), Efficient Net B0 (Tan & Le, 2019), UNet models (Ronneberger et al., 2015), the MONAI library (The MONAI Consortium, 2020), scikit-image (van der Walt et al., 2014), and Albumentations (implied by Table 3). However, specific version numbers for these libraries or frameworks are not provided in the text.
Experiment Setup	Yes	During training, we applied the augmentations with random strength (where applicable) and with 50 % probability on each image in an online fashion. ... The images were normalized and resized so that the longest edge measured 224 px and padded (if needed) so that the final image measured 224 224 px before applying data augmentation. The only exception was when using random crop, in which case the images were resized to 256 256 px before being cropped to 224 224 px. ... For each task, we performed 30 training runs using different random seeds for each augmentation. ... Each model was trained for a minimum of 100 epochs... We optimized the key regularization hyperparameters (training length in epochs, learning rate, dropout rates, and weight decay values) per task using the Optuna hyperparameter tuning framework. ... The values for the number of epochs were sampled from the set {50, 100, 200}, the learning rate sampled from the log domain between (10^6, 10^3), the dropout rates between (0.0, 0.5), and the weight decay from the log domain between (10^4, 10^2).