Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Distributional Random Forests: Heterogeneity Adjustment and Multivariate Distributional Regression
Authors: Domagoj Cevid, Loris Michel, Jeffrey Näf, Peter Bühlmann, Nicolai Meinshausen
JMLR 2022 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Finally, we show on a broad range of examples in Section 4 how many different statistical estimation problems, some of which not being easily tractable by existing forest-based methods, can be cast to our framework, thus illustrating the usefulness and versatility of DRF. 4. Applications and Numerical Experiments The goal of this section is to demonstrate the versatility and applicability of DRF for many practical problems. We show that DRF can be used not only as an estimator of the multivariate conditional distribution, but also as a two-step method to easily obtain out-of-the box estimators for various, and potentially complex, targets τ(x). Our main focus lies on the more complicated targets which cannot be that straightforwardly approached by conventional methods. However, we also illustrate the usage of DRF for certain applications for which there already exist several well-established methods. Whenever possible in such cases, we compare the performance of DRF with the specialized, task-specific methods to show that, despite its generality, there is at most a very small loss of precision. However, we should point out that for many targets such as, that can not be written in a form of a conditional mean or a conditional quantile, for example, conditional correlation, direct comparison of the accuracy is not possible for real data, since no suitable loss function exists and the ground truth is unknown. |
| Researcher Affiliation | Academia | Domagoj Cevid EMAIL Loris Michel EMAIL Jeffrey N af EMAIL Peter B uhlmann EMAIL Nicolai Meinshausen EMAIL Seminar f ur Statistik ETH Z urich 8092 Z urich, Switzerland |
| Pseudocode | Yes | Appendix A. Implementation Details Here we present the implementation of the Distributional Random Forests (DRF) in detail. The code is available in the R-package drf and the Python package drf. The implementation is based on the implementations of the R-packages grf (Athey et al., 2019) and ranger (Wright and Ziegler, 2017). The largest difference is in the splitting criterion itself and the provided user interface. Algorithm 1 gives the pseudocode for the forest construction and computation of the weighting function wx( ). |
| Open Source Code | Yes | The code is available as Python and R packages drf. |
| Open Datasets | Yes | Five years (2015 2019) of air pollution measurements were obtained from the US Environmental Protection Agency (EPA) website. Six main air pollutants (nitrogen dioxide (NO2), carbon monoxide (CO), sulphur dioxide (SO2), ozone (O3) and coarse and fine particulate matter (PM10 and PM2.5)) that form the air quality index (AQI) were measured at many different measuring sites in the US for which we know the longitude, latitude, elevation, location setting (rural, urban, suburban) and how the land is used within a 1/4 mile radius. We use the benchmark data sets from the multi-target regression literature (Tsoumakas et al., 2011) together with some additional ones created from the data sets described throughout this paper. This data set is obtained from the CDC Vital Statistics Data Online Portal (https: //www.cdc.gov/nchs/data_access/vitalstatsonline.htm) and contains the information about the 3.8 million births in 2018. The PUMS (Public Use Microdata Area) data from the 2018 1-Year American Community Survey is obtained from the US Census Bureau API (https://www.census.gov/ content/dam/Census/data/developers/api-user-guide/api-guide.pdf). |
| Dataset Splits | Yes | We run all methods and compute the root mean squared error of the obtained CATE estimate on a randomly generated test set Xtest containing 1000 data points. CATE corresponds to the coefficient of W in the data generating mechanism of Y . We repeat the same procedure 100 times and report the average result. We additionally include the estimates obtained by k-nearest neighbor algorithm for several different values of k. Furthermore, Table 3 shows non-inferiority of DRF compared to the standard Random Forest for the classical task of estimating the conditional mean. We observe that DRF has a good relative performance that makes it on par with existing algorithms, some of which specially designed for the problem of estimating conditional quantiles. The performance of each method is evaluated as follows: We consider the quantile (pinball) loss for the resulting quantile estimates provided by each candidate method for the different percentiles α {0.1, 0.3, 0.5, 0.7, 0.9}. The losses are presented and computed based on repeated (10 times) out-of-sample validation (with a 70 30% ratio between the training and testing sets sizes). We use randomly chosen 100 000 data points for training the DRF. |
| Hardware Specification | No | No specific hardware details (like CPU/GPU models, memory, or specific cloud instances) were mentioned in the paper, neither in the main text nor in the appendices, for running the experiments. |
| Software Dependencies | Yes | The code is available in the R-package drf and the Python package drf. The implementation is based on the implementations of the R-packages grf (Athey et al., 2019) and ranger (Wright and Ziegler, 2017). We estimate the mean with smoothing splines with a small manually chosen number of degrees of freedom. We fit the 0.1 and 0.9 quantiles as the best linear functions that minimize the sum of quantile losses, by using the quantreg package (Koenker et al., 2012). |
| Experiment Setup | Yes | Every tree is constructed based on a random subset of size s (taken to be 50% of the size of the training set by default) of the training data set, similar to Wager and Athey (2018). This differs from the original Random Forest algorithm (Breiman, 2001), where the bootstrap subsampling is done by drawing from the original sample with replacement. The principle of honesty (Biau, 2012; Denil et al., 2014; Wager and Athey, 2018) is used for building the trees (line 4), whereby for each tree one first performs the splitting based on one random set of data points Sbuild, and then populates the leaves with a disjoint random set Spopulate of data points for determining the weighting function wx( ). This prevents overfitting, since we do not assign weight to the data points which we used to built the tree. We borrow the method for selecting the number of candidate splitting variables from the grf package (Athey et al., 2019). This number is randomly generated as min(max(Poisson(mtry), 1), p), where mtry is a tuning parameter. This differs from the original Random Forests algorithm, where the number of splitting candidates is fixed to be mtry. The number of trees built is N = 2000 by default. We try to enforce splits where each child has at least a fixed percentage (chosen to be 10% as the default value) of the current number of data points. In this way we achieve balanced splits and reduce the computational time. However, we cannot enforce this if we are trying to split on the variable Xi with only a few unique values, e.g. indicator variable for a level of some factor variable. All components of the response Y are scaled for the building step (but not when we populate the leaves). This ensures that each component of the response contributes equally to the kernel values, and consequently to the MMD two-sample test statistic. Plain usage of the MMD two-sample test would scale the components of Y at each node. However, this approach favors always splitting on the same variables, even though their effect will diminish significantly after having split several times. By default, in step 20 of the Algorithm 1, we use the MMD-based splitting criterion given by... The Gaussian kernel k(x, y) = 1 ( 2πσ)d e x y 2 2 2σ2 is used as the default choice, with the bandwidth σ chosen as the median pairwise distance between all training responses {yi}n i=1, commonly referred to as the median heuristic (Gretton et al., 2012c). However the algorithm can be used with any choice of kernel, or in fact with any two-sample test. The number B of random Fourier features is fixed and taken to be 20 by default. |