Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
A Simulation Environment and Reinforcement Learning Method for Waste Reduction
Authors: Sami Jullien, Mozhdeh Ariannezhad, Paul Groth, Maarten de Rijke
TMLR 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we compare the performance in inventory replenishment simulation of our new baseline against a number of baselines (RQ3), for a variety of scenarios. We want to see whether we can improve overall profit (RQ1), and, if so, if it comes at the cost of generating more waste (RQ2). We train our DQN-family policies (baselines and GTDQN) on a total of 6 000 pseudo-items, for transitions of 5 000 steps. [...] Table 1: Human-normalized profit (higher is better). Results on trajectories of length 2 000, averaged over 3 000 items, for 3 different consumption volatility scenarios. |
| Researcher Affiliation | Academia | Sami Jullien EMAIL AIRLab University of Amsterdam Amsterdam, The Netherlands Mozhdeh Ariannezhad EMAIL AIRLab University of Amsterdam Amsterdam, The Netherlands Paul Groth EMAIL University of Amsterdam Amsterdam, The Netherlands Maarten de Rijke EMAIL University of Amsterdam Amsterdam, The Netherlands |
| Pseudocode | Yes | Algorithm 1: Generalized Lambda Distribution Q-Learning Require: quantiles {q1, . . . , q N}, parameter δ Input : o, a, r, o , γ [0, 1] 1 Λ(o , a ), a A # Compute distribution parameters ; 2 Λ arg maxa ˆµ(Λ(o , a )) # Compute optimal action (Equation 7) ; 3 T qi r + γαΛ (qi), i # Update projection via Equation 4 ; 4 Optimize via loss function (Equation 6) ; Output : ΣN j=1Ei[Lδ qj(T qi, αΛ(o,a)(qj))] |
| Open Source Code | Yes | Our code was implemented in Py Torch (Paszke et al., 2019) and is available on Git Hub.9 https://github.com/samijullien/GTDQN |
| Open Datasets | Yes | Using real-world data of items being currently sold is impossible, as it would contain confidential information (e.g., the cost obtained from the supplier). This is why we fit a copula on the data we sourced from the retailer to be able to generate what we call pseudo-items: tuples that follow the same distribution as our actual item set. [...] We provide the item generation model and its parameters along with our experiments. |
| Dataset Splits | Yes | We train our DQN-family policies (baselines and GTDQN) on a total of 6 000 pseudo-items, for transitions of 5 000 steps. Morover, we do not train our agents in an average reward framework, as discounting also presents an interest for accounting in supply chain planning (Beamon & Fernandes, 2004). We evaluate the performance of our agents on a total of 30 generations of 100 unseen pseudo-items, for 2 000 steps. |
| Hardware Specification | Yes | We ran our experiments on a RTX A6000 GPU, 16 CPU cores and 128GB RAM. |
| Software Dependencies | No | Our code was implemented in Py Torch (Paszke et al., 2019) and is available on Git Hub. (No version given for PyTorch). |
| Experiment Setup | Yes | We train our DQN-family policies (baselines and GTDQN) on a total of 6 000 pseudo-items, for transitions of 5 000 steps. [...] We performed a grid search on DQN for all hyperparameters, and kept those for all models. However, we set the exploration rate at 0.01 for distributional methods, followinq (Dabney et al., 2018). |