Coordinated Double Machine Learning

Authors: Nitai Fingerhut, Matteo Sesia, Yaniv Romano

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The improved empirical performance of the proposed method is demonstrated through numerical experiments on both simulated and real data.
Researcher Affiliation Academia 1Departments of Electrical and Computer Engineering and of Computer Science, Technion, Israel 2Department of Data Sciences and Operations, University of Southern California, CA, USA.
Pseudocode Yes Algorithm 1 DML, Algorithm 2 C-DML, Algorithm 3 C-DML with fixed (I1, I2, α, β, γ)
Open Source Code Yes A Python implementation of the methods described in this paper is available from https://github.com/ nitaifingerhut/C-DML.git, along with tutorials and code to reproduce the experiments.
Open Datasets Yes semi-synthetic numerical experiments based on financial data borrowed from Chernozhukov & Hansen (2004); see Section 4 for more details. In particular, we borrow a data set from Chernozhukov & Hansen (2004) which has also been used by Chernozhukov et al. (2018). The Beijing air quality data set, presented in Zhang et al. (2017). The Facebook blog feedback data set of Buza (2014). The CCDDHNR2018 data set of Bach (2021) is integrated into Double ML Python package of Bach et al. (2022).
Dataset Splits Yes The observations are randomly divided into two disjoint subsets, I1, I2. To overcome this issue, Chernozhukov et al. (2018) suggested cross-fitting. This is achieved by further splitting I2 into two disjoint subsets, I2,1, I2,2. In each experiment, the data are divided into three disjoint subsets, namely I1 (containing 50% of the observations), I2,1 (containing 25% of the observations), and I2,2 (containing 25% of the observations).
Hardware Specification No The paper does not specify the hardware used for running the experiments (e.g., GPU/CPU models, memory).
Software Dependencies No Random forest regression models are implemented using the Python package sklearn. The paper mentions deep neural networks but does not specify the framework (e.g., TensorFlow, PyTorch) or their versions.
Experiment Setup Yes The learning rate is fixed to 0.01, clipping gradients with norms larger than 3. Early stopping is utilized to avoid overfitting; the number of epochs (capped at 2000) is tuned by evaluating the loss function on a hold-out data set. Random forest regression models are implemented using the Python package sklearn. The default hyper-parameters are utilized, except the number of trees in the forest and the maximal depth, both of which are set equal to 20.