Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

A Simulation Environment and Reinforcement Learning Method for Waste Reduction

Authors: Sami Jullien, Mozhdeh Ariannezhad, Paul Groth, Maarten de Rijke

TMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we compare the performance in inventory replenishment simulation of our new baseline against a number of baselines (RQ3), for a variety of scenarios. We want to see whether we can improve overall proﬁt (RQ1), and, if so, if it comes at the cost of generating more waste (RQ2). We train our DQN-family policies (baselines and GTDQN) on a total of 6 000 pseudo-items, for transitions of 5 000 steps. [...] Table 1: Human-normalized proﬁt (higher is better). Results on trajectories of length 2 000, averaged over 3 000 items, for 3 diﬀerent consumption volatility scenarios.
Researcher Affiliation	Academia	Sami Jullien EMAIL AIRLab University of Amsterdam Amsterdam, The Netherlands Mozhdeh Ariannezhad EMAIL AIRLab University of Amsterdam Amsterdam, The Netherlands Paul Groth EMAIL University of Amsterdam Amsterdam, The Netherlands Maarten de Rijke EMAIL University of Amsterdam Amsterdam, The Netherlands
Pseudocode	Yes	Algorithm 1: Generalized Lambda Distribution Q-Learning Require: quantiles {q1, . . . , q N}, parameter δ Input : o, a, r, o , γ [0, 1] 1 Λ(o , a ), a A # Compute distribution parameters ; 2 Λ arg maxa ˆµ(Λ(o , a )) # Compute optimal action (Equation 7) ; 3 T qi r + γαΛ (qi), i # Update projection via Equation 4 ; 4 Optimize via loss function (Equation 6) ; Output : ΣN j=1Ei[Lδ qj(T qi, αΛ(o,a)(qj))]
Open Source Code	Yes	Our code was implemented in Py Torch (Paszke et al., 2019) and is available on Git Hub.9 https://github.com/samijullien/GTDQN
Open Datasets	Yes	Using real-world data of items being currently sold is impossible, as it would contain conﬁdential information (e.g., the cost obtained from the supplier). This is why we ﬁt a copula on the data we sourced from the retailer to be able to generate what we call pseudo-items: tuples that follow the same distribution as our actual item set. [...] We provide the item generation model and its parameters along with our experiments.
Dataset Splits	Yes	We train our DQN-family policies (baselines and GTDQN) on a total of 6 000 pseudo-items, for transitions of 5 000 steps. Morover, we do not train our agents in an average reward framework, as discounting also presents an interest for accounting in supply chain planning (Beamon & Fernandes, 2004). We evaluate the performance of our agents on a total of 30 generations of 100 unseen pseudo-items, for 2 000 steps.
Hardware Specification	Yes	We ran our experiments on a RTX A6000 GPU, 16 CPU cores and 128GB RAM.
Software Dependencies	No	Our code was implemented in Py Torch (Paszke et al., 2019) and is available on Git Hub. (No version given for PyTorch).
Experiment Setup	Yes	We train our DQN-family policies (baselines and GTDQN) on a total of 6 000 pseudo-items, for transitions of 5 000 steps. [...] We performed a grid search on DQN for all hyperparameters, and kept those for all models. However, we set the exploration rate at 0.01 for distributional methods, followinq (Dabney et al., 2018).