Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
A Coupled Flow Approach to Imitation Learning
Authors: Gideon Joseph Freund, Elad Sarafian, Sarit Kraus
ICML 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate CFIL on the standard Mujoco benchmarks (Todorov et al., 2012), first comparing it to state-of-the-art imitation methods, including Value DICE (Kostrikov et al., 2019) and their optimized implementation of DAC (Kostrikov et al., 2018), along with a customary behavioral cloning (BC) baseline. |
| Researcher Affiliation | Academia | 1Department of Computer Science, Bar-Ilan University, Israel. Correspondence to: Gideon Freund <EMAIL>. |
| Pseudocode | Yes | Our resulting algorithm, Coupled Flow Imitation Learning (CFIL). It is summarized in Algorithm 1 |
| Open Source Code | Yes | Code for reproducibility of CFIL, including a detailed description for reproducing our environment, is available at https: //github.com/gfreund123/cfil. |
| Open Datasets | Yes | We use Value DICE s original expert demonstrations, with exception to the Humanoid environment, for which we train our own expert, since they did not originally evaluate on it. We use Value DICE s open-source implementation to comfortably run all three baselines. NDI (Kim et al., 2021b) would be the ideal candidate for comparison, given the similarities, however no code was available. |
| Dataset Splits | No | The paper specifies training details and evaluation metrics (e.g., "evaluating over 10 episodes after each") but does not explicitly mention distinct training/validation/test splits with percentages or counts for a dataset, typical in supervised learning. For RL, evaluation episodes on the environment serve a similar purpose to testing, but a dedicated validation split is not specified. |
| Hardware Specification | No | The paper does not provide any specific details about the hardware (e.g., CPU, GPU models, memory, or cloud instance types) used to run the experiments. |
| Software Dependencies | No | The paper mentions software like "Spinning Ups s (Achiam, 2018) SAC (Haarnoja et al., 2018)" and the "Adam optimizer (Kingma & Ba, 2014)" but does not provide specific version numbers for these libraries or frameworks. It also refers to an "open-source implementation (Bliznashki, 2019)" for MAF, but this is a citation, not a version number for the software dependency itself. |
| Experiment Setup | Yes | Our density update rate is 10 batches of 100, every 1000 timesteps. We use the Adam optimizer (Kingma & Ba, 2014) with a learning rate of 0.001. For squashing we use σ = 6tanh( x/15), while the smoothing and regularization coefficients are 0.5 and 1 respectively. For all algorithms, we run 80 epochs, each consisting of 4000 timesteps, evaluating over 10 episodes after each. We do this across 5 random seeds and plot means and standard deviations. |