

{"id":146,"date":"2018-08-11T08:37:52","date_gmt":"2018-08-11T06:37:52","guid":{"rendered":"http:\/\/blog.hwr-berlin.de\/codeandstats\/?p=146"},"modified":"2020-09-08T07:28:32","modified_gmt":"2020-09-08T05:28:32","slug":"variable-importance-in-random-forests","status":"publish","type":"post","link":"https:\/\/blog.hwr-berlin.de\/codeandstats\/variable-importance-in-random-forests\/","title":{"rendered":"Variable Importance in Random Forests"},"content":{"rendered":"<div class=\"container-fluid main-container\">\n<p><!-- code folding --><\/p>\n<div id=\"header\" class=\"fluid-row\">\n<h1 class=\"title toc-ignore\">Variable Importance in Random Forests can suffer from severe overfitting<\/h1>\n<\/div>\n<div id=\"predictive-vs.interpretational-overfitting\" class=\"section level2\">\n<h2>Predictive vs.\u00a0interpretational overfitting<\/h2>\n<p>There appears to be broad consenus that random forests rarely suffer from \u201coverfitting\u201d which plagues many other models. (We define <em>overfitting<\/em> as choosing a model flexibility which is too high for the data generating process at hand resulting in non-optimal performance on an independent test set.) By averaging many (hundreds) of separately grown deep trees -each of which inevitably overfits the data &#8211; one often achieves a favorable balance in the bias variance tradeoff.<br \/>\nFor similar reasons, the need for careful parameter tuning also seems less essential than in other models.<\/p>\n<p>This post does not attempt to contribute to this long standing discussion (see e.g.\u00a0<a class=\"uri\" href=\"https:\/\/stats.stackexchange.com\/questions\/66543\/random-forest-is-overfitting\">https:\/\/stats.stackexchange.com\/questions\/66543\/random-forest-is-overfitting<\/a>) but points out that random forests\u2019 immunity to overfitting is restricted to the predictions only and not to the default variable importance measure!<\/p>\n<p>We assume the reader is familiar with the basic construction of random forests which are averages of large numbers of individually grown regression\/classification trees. The random nature stems from both &#8220;row and column subsampling\u2019\u2019: each tree is based on a random subset of the observations, and each split is based on a random subset of candidate variables. The tuning parameter \u2013 which for popular software implementations has the default <span class=\"math inline\">\\(\\lfloor p\/3 \\rfloor\\)<\/span> for regression and <span class=\"math inline\">\\(\\sqrt{p}\\)<\/span> for classification trees \u2013 can have profound effects on prediction quality as well as the variable importance measures outlined below.<\/p>\n<p>At the heart of the random forest library is the CART algorithm which chooses the split for each node such that maximum reduction in overall node impurity is achieved. Due to the CART bootstrap row sampling, <span class=\"math inline\">\\(36.8\\%\\)<\/span> of the observations are (on average) not used for an individual tree; those \u201cout of bag\u201d (OOB) samples can serve as a validation set to estimate the test error, e.g.:<br \/>\n<span class=\"math display\">\\[\\begin{equation}<br \/>\nE\\left( Y &#8211; \\hat{Y}\\right)^2 \\approx OOB_{MSE} = \\frac{1}{n} \\sum_{i=1}^n{\\left( y_i &#8211; \\overline{\\hat{y}}_{i, OOB}\\right)^2}<br \/>\n\\end{equation}\\]<\/span><\/p>\n<p>where <span class=\"math inline\">\\(\\overline{\\hat{y}}_{i, OOB}\\)<\/span> is the average prediction for the <span class=\"math inline\">\\(i\\)<\/span>th observation from those trees for which this observation was OOB.<\/p>\n<div id=\"variable-importance\" class=\"section level3\">\n<h3>Variable Importance<\/h3>\n<p>The default method to compute variable importance is the <em>mean decrease in impurity<\/em> (or <em>gini importance<\/em>) mechanism: At each split in each tree, the improvement in the split-criterion is the importance measure attributed to the splitting variable, and is accumulated over all the trees in the forest separately for each variable. Note that this measure is quite like the <span class=\"math inline\">\\(R^2\\)<\/span> in regression on the training set.<\/p>\n<p>The widely used alternative as a measure of variable importance or short <strong>permutation importance<\/strong> is defined as follows:<br \/>\n<span class=\"math display\">\\[\\begin{equation}<br \/>\n\\label{eq:VI}<br \/>\n\\mbox{VI} = OOB_{MSE, perm} &#8211; OOB_{MSE}<br \/>\n\\end{equation}\\]<\/span><\/p>\n<p>(more pointers in <a class=\"uri\" href=\"https:\/\/stackoverflow.com\/questions\/15810339\/how-are-feature-importances-in-randomforestclassifier-determined\">https:\/\/stackoverflow.com\/questions\/15810339\/how-are-feature-importances-in-randomforestclassifier-determined<\/a>)<\/p>\n<\/div>\n<div id=\"gini-importance-can-be-highly-misleading\" class=\"section level3\">\n<h3>Gini importance can be highly misleading<\/h3>\n<p>We use the well known titanic data set to illustrate the perils of putting too much faith into the Gini importance which is based entirely on training data &#8211; not on OOB samples &#8211; and makes no attempt to discount impurity decreases in deep trees that are pretty much frivolous and will not survive in a validation set.<\/p>\n<p>In the following model we include <em>passengerID<\/em> as a feature along with the more reasonable Age, Sex and Pclass: <code>randomForest(Survived ~ Age + Sex + Pclass + PassengerId, data=titanic_train[!naRows,], ntree=200,importance=TRUE,mtry=2)<\/code><\/p>\n<p>The figure below shows both measures of variable importance and surprisingly <em>passengerID<\/em> turns out to be ranked number 2 for the Gini importance (right panel). This unexpected result is robust to random shuffling of the ID.<\/p>\n<p>The permutation based importance (left panel) is not fooled by the irrelevant ID feature. This is maybe not unexpected as the IDs shold bear no predictive power for the out-of-bag samples.<\/p>\n<p><a href=\"http:\/\/blog.hwr-berlin.de\/codeandstats\/wp-content\/uploads\/2018\/08\/unnamed-chunk-2-1.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-171\" src=\"http:\/\/blog.hwr-berlin.de\/codeandstats\/wp-content\/uploads\/2018\/08\/unnamed-chunk-2-1.png\" alt=\"\" width=\"1536\" height=\"960\" \/><\/a><\/p>\n<\/div>\n<div id=\"noise-feature\" class=\"section level3\">\n<h3>Noise Feature<\/h3>\n<p>Let us go one step further and add a Gaussian noise feature, which we call PassengerWeight:<\/p>\n<pre class=\"r\"><code>titanic_train$PassengerWeight = rnorm(nrow(titanic_train),70,20)\nrf4 =randomForest(Survived ~ Age + Sex + Pclass + PassengerId + PassengerWeight, data=titanic_train[!naRows,], ntree=200,importance=TRUE,mtry=2)<\/code><\/pre>\n<p><a href=\"http:\/\/blog.hwr-berlin.de\/codeandstats\/wp-content\/uploads\/2018\/08\/unnamed-chunk-4-1.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-171\" src=\"http:\/\/blog.hwr-berlin.de\/codeandstats\/wp-content\/uploads\/2018\/08\/unnamed-chunk-4-1.png\" alt=\"\" width=\"1536\" height=\"960\" srcset=\"https:\/\/blog.hwr-berlin.de\/codeandstats\/wp-content\/uploads\/2018\/08\/unnamed-chunk-4-1.png 1536w, https:\/\/blog.hwr-berlin.de\/codeandstats\/wp-content\/uploads\/2018\/08\/unnamed-chunk-4-1-300x188.png 300w, https:\/\/blog.hwr-berlin.de\/codeandstats\/wp-content\/uploads\/2018\/08\/unnamed-chunk-4-1-768x480.png 768w, https:\/\/blog.hwr-berlin.de\/codeandstats\/wp-content\/uploads\/2018\/08\/unnamed-chunk-4-1-1024x640.png 1024w\" sizes=\"auto, (max-width: 1536px) 100vw, 1536px\" \/><\/a><\/p>\n<p>Again, the blatant \u201coverfitting\u201d of the Gini variable importance is troubling whereas the permutation based importance (left panel) is not fooled by the irrelevant features. (Encouragingly, the importance measures for ID and weight are even negative!)<\/p>\n<p>In the remainder we investigate if other libraries suffer from similar spurious variable importance measures.<\/p>\n<\/div>\n<\/div>\n<div id=\"h2o-library\" class=\"section level2\">\n<h2>h2o library<\/h2>\n<p>Unfortunately, the h2o random forest implementation does not offer permutation importance:<\/p>\n<p><a class=\"uri\" href=\"https:\/\/stackoverflow.com\/questions\/51584970\/permutation-importance-in-h2o-random-forest\/51598742#51598742\">https:\/\/stackoverflow.com\/questions\/51584970\/permutation-importance-in-h2o-random-forest\/51598742#51598742<\/a><\/p>\n<p>Coding passenger ID as integer is bad enough:<\/p>\n<p><a href=\"http:\/\/blog.hwr-berlin.de\/codeandstats\/wp-content\/uploads\/2018\/08\/unnamed-chunk-7-1.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-163\" src=\"http:\/\/blog.hwr-berlin.de\/codeandstats\/wp-content\/uploads\/2018\/08\/unnamed-chunk-7-1.png\" alt=\"\" width=\"1536\" height=\"960\" srcset=\"https:\/\/blog.hwr-berlin.de\/codeandstats\/wp-content\/uploads\/2018\/08\/unnamed-chunk-7-1.png 1536w, https:\/\/blog.hwr-berlin.de\/codeandstats\/wp-content\/uploads\/2018\/08\/unnamed-chunk-7-1-300x188.png 300w, https:\/\/blog.hwr-berlin.de\/codeandstats\/wp-content\/uploads\/2018\/08\/unnamed-chunk-7-1-768x480.png 768w, https:\/\/blog.hwr-berlin.de\/codeandstats\/wp-content\/uploads\/2018\/08\/unnamed-chunk-7-1-1024x640.png 1024w\" sizes=\"auto, (max-width: 1536px) 100vw, 1536px\" \/><\/a><\/p>\n<p>Coding passenger ID as factor makes matters worse:<\/p>\n<p>Let\u2019s look at a single tree from the forest:<\/p>\n<p><a href=\"http:\/\/blog.hwr-berlin.de\/codeandstats\/wp-content\/uploads\/2018\/08\/unnamed-chunk-9-1.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-164\" src=\"http:\/\/blog.hwr-berlin.de\/codeandstats\/wp-content\/uploads\/2018\/08\/unnamed-chunk-9-1.png\" alt=\"\" width=\"1536\" height=\"960\" srcset=\"https:\/\/blog.hwr-berlin.de\/codeandstats\/wp-content\/uploads\/2018\/08\/unnamed-chunk-9-1.png 1536w, https:\/\/blog.hwr-berlin.de\/codeandstats\/wp-content\/uploads\/2018\/08\/unnamed-chunk-9-1-300x188.png 300w, https:\/\/blog.hwr-berlin.de\/codeandstats\/wp-content\/uploads\/2018\/08\/unnamed-chunk-9-1-768x480.png 768w, https:\/\/blog.hwr-berlin.de\/codeandstats\/wp-content\/uploads\/2018\/08\/unnamed-chunk-9-1-1024x640.png 1024w\" sizes=\"auto, (max-width: 1536px) 100vw, 1536px\" \/><\/a><\/p>\n<p>If we scramble ID, does it hold up?<\/p>\n<p><a href=\"http:\/\/blog.hwr-berlin.de\/codeandstats\/wp-content\/uploads\/2018\/08\/unnamed-chunk-11-1.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-165\" src=\"http:\/\/blog.hwr-berlin.de\/codeandstats\/wp-content\/uploads\/2018\/08\/unnamed-chunk-11-1.png\" alt=\"\" width=\"1536\" height=\"960\" srcset=\"https:\/\/blog.hwr-berlin.de\/codeandstats\/wp-content\/uploads\/2018\/08\/unnamed-chunk-11-1.png 1536w, https:\/\/blog.hwr-berlin.de\/codeandstats\/wp-content\/uploads\/2018\/08\/unnamed-chunk-11-1-300x188.png 300w, https:\/\/blog.hwr-berlin.de\/codeandstats\/wp-content\/uploads\/2018\/08\/unnamed-chunk-11-1-768x480.png 768w, https:\/\/blog.hwr-berlin.de\/codeandstats\/wp-content\/uploads\/2018\/08\/unnamed-chunk-11-1-1024x640.png 1024w\" sizes=\"auto, (max-width: 1536px) 100vw, 1536px\" \/><\/a><\/p>\n<\/div>\n<div id=\"partykit\" class=\"section level2\">\n<h2>partykit<\/h2>\n<p>conditional inference trees are not being fooled by ID:<\/p>\n<p><a href=\"http:\/\/blog.hwr-berlin.de\/codeandstats\/wp-content\/uploads\/2018\/08\/unnamed-chunk-13-1.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-166\" src=\"http:\/\/blog.hwr-berlin.de\/codeandstats\/wp-content\/uploads\/2018\/08\/unnamed-chunk-13-1.png\" alt=\"\" width=\"1536\" height=\"960\" srcset=\"https:\/\/blog.hwr-berlin.de\/codeandstats\/wp-content\/uploads\/2018\/08\/unnamed-chunk-13-1.png 1536w, https:\/\/blog.hwr-berlin.de\/codeandstats\/wp-content\/uploads\/2018\/08\/unnamed-chunk-13-1-300x188.png 300w, https:\/\/blog.hwr-berlin.de\/codeandstats\/wp-content\/uploads\/2018\/08\/unnamed-chunk-13-1-768x480.png 768w, https:\/\/blog.hwr-berlin.de\/codeandstats\/wp-content\/uploads\/2018\/08\/unnamed-chunk-13-1-1024x640.png 1024w\" sizes=\"auto, (max-width: 1536px) 100vw, 1536px\" \/><\/a><\/p>\n<p>And the variable importance in <em>cforest<\/em> is indeed unbiased<\/p>\n<p><a href=\"http:\/\/blog.hwr-berlin.de\/codeandstats\/wp-content\/uploads\/2018\/08\/unnamed-chunk-15-1.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-167\" src=\"http:\/\/blog.hwr-berlin.de\/codeandstats\/wp-content\/uploads\/2018\/08\/unnamed-chunk-15-1.png\" alt=\"\" width=\"1536\" height=\"960\" srcset=\"https:\/\/blog.hwr-berlin.de\/codeandstats\/wp-content\/uploads\/2018\/08\/unnamed-chunk-15-1.png 1536w, https:\/\/blog.hwr-berlin.de\/codeandstats\/wp-content\/uploads\/2018\/08\/unnamed-chunk-15-1-300x188.png 300w, https:\/\/blog.hwr-berlin.de\/codeandstats\/wp-content\/uploads\/2018\/08\/unnamed-chunk-15-1-768x480.png 768w, https:\/\/blog.hwr-berlin.de\/codeandstats\/wp-content\/uploads\/2018\/08\/unnamed-chunk-15-1-1024x640.png 1024w\" sizes=\"auto, (max-width: 1536px) 100vw, 1536px\" \/><\/a><\/p>\n<\/div>\n<div id=\"pythons-sklearn\" class=\"section level2\">\n<h2>python\u2019s sklearn<\/h2>\n<p>Unfortunately, like h2o the python random forest implementation offers only Gini importance, but this insightful post offers a solution:<\/p>\n<p><a class=\"uri\" href=\"http:\/\/explained.ai\/rf-importance\/index.html\">http:\/\/explained.ai\/rf-importance\/index.html<\/a><\/p>\n<\/div>\n<div id=\"gradient-boosting\" class=\"section level2\">\n<h2>Gradient Boosting<\/h2>\n<p>Boosting is highly robust against frivolous columns:<\/p>\n<pre class=\"r\"><code>mdlGBM = gbm(Survived ~ Age + Sex + Pclass + PassengerId +PassengerWeight, data= titanic_train, n.trees = 300, shrinkage = 0.01, distribution = \"gaussian\")<\/code><\/pre>\n<p><a href=\"http:\/\/blog.hwr-berlin.de\/codeandstats\/wp-content\/uploads\/2018\/08\/boostingInfluence.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-174\" src=\"http:\/\/blog.hwr-berlin.de\/codeandstats\/wp-content\/uploads\/2018\/08\/boostingInfluence.png\" alt=\"\" width=\"1536\" height=\"960\" srcset=\"https:\/\/blog.hwr-berlin.de\/codeandstats\/wp-content\/uploads\/2018\/08\/boostingInfluence.png 1536w, https:\/\/blog.hwr-berlin.de\/codeandstats\/wp-content\/uploads\/2018\/08\/boostingInfluence-300x188.png 300w, https:\/\/blog.hwr-berlin.de\/codeandstats\/wp-content\/uploads\/2018\/08\/boostingInfluence-768x480.png 768w, https:\/\/blog.hwr-berlin.de\/codeandstats\/wp-content\/uploads\/2018\/08\/boostingInfluence-1024x640.png 1024w\" sizes=\"auto, (max-width: 1536px) 100vw, 1536px\" \/><\/a><\/p>\n<\/div>\n<div id=\"conclusion\" class=\"section level2\">\n<h2>Conclusion<\/h2>\n<p>Sadly, this post is 12 years behind:<\/p>\n<p>It has been known for while now that the Gini importance tends to inflate the importance of continuous or high-cardinality categorical variables:<\/p>\n<blockquote><p>the variable importance measures of Breiman\u2019s original Random Forest method \u2026 are not reliable in situations where potential predictor variables vary in their scale of measurement or their number of categories.<\/p><\/blockquote>\n<p>(Strobl et al, 2007 <a href=\"https:\/\/link.springer.com\/article\/10.1186%2F1471-2105-8-25\">Bias in random forest variable importance measures: Illustrations, sources and a solution<\/a>)<\/p>\n<\/div>\n<div id=\"single-trees\" class=\"section level2\">\n<h2>Single Trees<\/h2>\n<p>I am still struggling with the extent of the overfitting. It is hard to believe that passenger ID could be chosen as a split point <strong>early<\/strong> in the tree building process given the other informative variables! Let us inspect a single tree<\/p>\n<p><a href=\"http:\/\/blog.hwr-berlin.de\/codeandstats\/wp-content\/uploads\/2018\/08\/unnamed-chunk-19-1.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-168\" src=\"http:\/\/blog.hwr-berlin.de\/codeandstats\/wp-content\/uploads\/2018\/08\/unnamed-chunk-19-1.png\" alt=\"\" width=\"1536\" height=\"960\" srcset=\"https:\/\/blog.hwr-berlin.de\/codeandstats\/wp-content\/uploads\/2018\/08\/unnamed-chunk-19-1.png 1536w, https:\/\/blog.hwr-berlin.de\/codeandstats\/wp-content\/uploads\/2018\/08\/unnamed-chunk-19-1-300x188.png 300w, https:\/\/blog.hwr-berlin.de\/codeandstats\/wp-content\/uploads\/2018\/08\/unnamed-chunk-19-1-768x480.png 768w, https:\/\/blog.hwr-berlin.de\/codeandstats\/wp-content\/uploads\/2018\/08\/unnamed-chunk-19-1-1024x640.png 1024w\" sizes=\"auto, (max-width: 1536px) 100vw, 1536px\" \/><\/a><\/p>\n<pre><code>##   rowname left daughter right daughter   split var split point status\n## 1       1             2              3      Pclass         2.5      1\n## 2       2             4              5      Pclass         1.5      1\n## 3       3             6              7 PassengerId        10.0      1\n## 4       4             8              9         Sex         1.5      1\n## 5       5            10             11         Sex         1.5      1\n## 6       6            12             13 PassengerId         2.5      1\n##   prediction\n## 1       &lt;NA&gt;\n## 2       &lt;NA&gt;\n## 3       &lt;NA&gt;\n## 4       &lt;NA&gt;\n## 5       &lt;NA&gt;\n## 6       &lt;NA&gt;<\/code><\/pre>\n<p>This tree splits on passenger ID at the second level !! Let us dig deeper:<\/p>\n<p>The help page states<\/p>\n<blockquote><p>For numerical predictors, data with values of the variable less than or equal to the splitting point go to the left daughter node.<\/p><\/blockquote>\n<p>So we have the 3rd class passengers on the right branch. Compare subsequent splits on (i) sex, (ii) Pclass and (iii) passengerID:<\/p>\n<p>Starting with a parent node Gini impurity of 0.184<\/p>\n<p>Splitting on sex yields a Gini impurity of 0.159<\/p>\n<table class=\"table table-striped\" style=\"width: auto !important; margin-left: auto; margin-right: auto;\">\n<thead>\n<tr>\n<th style=\"text-align: left;\"><\/th>\n<th style=\"text-align: right;\">1<\/th>\n<th style=\"text-align: right;\">2<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td style=\"text-align: left;\">0<\/td>\n<td style=\"text-align: right;\">72<\/td>\n<td style=\"text-align: right;\">303<\/td>\n<\/tr>\n<tr>\n<td style=\"text-align: left;\">1<\/td>\n<td style=\"text-align: right;\">71<\/td>\n<td style=\"text-align: right;\">50<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Splitting on passengerID yields a Gini impurity of 0.183<\/p>\n<table class=\"table table-striped\" style=\"width: auto !important; margin-left: auto; margin-right: auto;\">\n<thead>\n<tr>\n<th style=\"text-align: left;\"><\/th>\n<th style=\"text-align: right;\">FALSE<\/th>\n<th style=\"text-align: right;\">TRUE<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td style=\"text-align: left;\">0<\/td>\n<td style=\"text-align: right;\">2<\/td>\n<td style=\"text-align: right;\">373<\/td>\n<\/tr>\n<tr>\n<td style=\"text-align: left;\">1<\/td>\n<td style=\"text-align: right;\">3<\/td>\n<td style=\"text-align: right;\">118<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>And how could passenger ID accrue more importance than sex ?<\/p>\n<p><a href=\"http:\/\/blog.hwr-berlin.de\/codeandstats\/wp-content\/uploads\/2018\/08\/unnamed-chunk-26-1.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-169\" src=\"http:\/\/blog.hwr-berlin.de\/codeandstats\/wp-content\/uploads\/2018\/08\/unnamed-chunk-26-1.png\" alt=\"\" width=\"1536\" height=\"960\" srcset=\"https:\/\/blog.hwr-berlin.de\/codeandstats\/wp-content\/uploads\/2018\/08\/unnamed-chunk-26-1.png 1536w, https:\/\/blog.hwr-berlin.de\/codeandstats\/wp-content\/uploads\/2018\/08\/unnamed-chunk-26-1-300x188.png 300w, https:\/\/blog.hwr-berlin.de\/codeandstats\/wp-content\/uploads\/2018\/08\/unnamed-chunk-26-1-768x480.png 768w, https:\/\/blog.hwr-berlin.de\/codeandstats\/wp-content\/uploads\/2018\/08\/unnamed-chunk-26-1-1024x640.png 1024w\" sizes=\"auto, (max-width: 1536px) 100vw, 1536px\" \/><\/a><\/p>\n<hr \/>\n<\/div>\n<\/div>\n<p><script><\/p>\n<p>\/\/ add bootstrap table styles to pandoc tables<br \/>\nfunction bootstrapStylePandocTables() {<br \/>\n  $('tr.header').parent('thead').parent('table').addClass('table table-condensed');<br \/>\n}<br \/>\n$(document).ready(function () {<br \/>\n  bootstrapStylePandocTables();<br \/>\n});<\/p>\n<p><\/script><\/p>\n<p><!-- dynamically load mathjax for compatibility with self-contained --><br \/>\n<script><br \/>\n  (function () {<br \/>\n    var script = document.createElement(\"script\");<br \/>\n    script.type = \"text\/javascript\";<br \/>\n    script.src  = \"https:\/\/mathjax.rstudio.com\/latest\/MathJax.js?config=TeX-AMS-MML_HTMLorMML\";<br \/>\n    document.getElementsByTagName(\"head\")[0].appendChild(script);<br \/>\n  })();<br \/>\n<\/script><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Variable Importance in Random Forests can suffer from severe overfitting Predictive vs.\u00a0interpretational overfitting There appears to be broad consenus that random forests rarely suffer from \u201coverfitting\u201d which plagues many other models. (We define overfitting as choosing a model flexibility which is too high for the data generating process at hand resulting in non-optimal performance on &hellip; <a href=\"https:\/\/blog.hwr-berlin.de\/codeandstats\/variable-importance-in-random-forests\/\" class=\"more-link\">Continue reading <span class=\"screen-reader-text\">Variable Importance in Random Forests<\/span><\/a><\/p>\n","protected":false},"author":3,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[3],"tags":[],"class_list":["post-146","post","type-post","status-publish","format-standard","hentry","category-r"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.4 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>Variable Importance in Random Forests - Code and Stats<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/blog.hwr-berlin.de\/codeandstats\/variable-importance-in-random-forests\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Variable Importance in Random Forests - Code and Stats\" \/>\n<meta property=\"og:description\" content=\"Variable Importance in Random Forests can suffer from severe overfitting Predictive vs.\u00a0interpretational overfitting There appears to be broad consenus that random forests rarely suffer from \u201coverfitting\u201d which plagues many other models. (We define overfitting as choosing a model flexibility which is too high for the data generating process at hand resulting in non-optimal performance on &hellip; Continue reading Variable Importance in Random Forests\" \/>\n<meta property=\"og:url\" content=\"https:\/\/blog.hwr-berlin.de\/codeandstats\/variable-importance-in-random-forests\/\" \/>\n<meta property=\"og:site_name\" content=\"Code and Stats\" \/>\n<meta property=\"article:published_time\" content=\"2018-08-11T06:37:52+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2020-09-08T05:28:32+00:00\" \/>\n<meta property=\"og:image\" content=\"http:\/\/blog.hwr-berlin.de\/codeandstats\/wp-content\/uploads\/2018\/08\/unnamed-chunk-2-1.png\" \/>\n<meta name=\"author\" content=\"Markus L\u00f6cher\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Markus L\u00f6cher\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"6 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\\\/\\\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\\\/\\\/blog.hwr-berlin.de\\\/codeandstats\\\/variable-importance-in-random-forests\\\/#article\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/blog.hwr-berlin.de\\\/codeandstats\\\/variable-importance-in-random-forests\\\/\"},\"author\":{\"name\":\"Markus L\u00f6cher\",\"@id\":\"https:\\\/\\\/blog.hwr-berlin.de\\\/codeandstats\\\/#\\\/schema\\\/person\\\/b8b34fa3bb407386693d915a45ba08be\"},\"headline\":\"Variable Importance in Random Forests\",\"datePublished\":\"2018-08-11T06:37:52+00:00\",\"dateModified\":\"2020-09-08T05:28:32+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\\\/\\\/blog.hwr-berlin.de\\\/codeandstats\\\/variable-importance-in-random-forests\\\/\"},\"wordCount\":1027,\"image\":{\"@id\":\"https:\\\/\\\/blog.hwr-berlin.de\\\/codeandstats\\\/variable-importance-in-random-forests\\\/#primaryimage\"},\"thumbnailUrl\":\"http:\\\/\\\/blog.hwr-berlin.de\\\/codeandstats\\\/wp-content\\\/uploads\\\/2018\\\/08\\\/unnamed-chunk-2-1.png\",\"articleSection\":[\"R\"],\"inLanguage\":\"en-US\"},{\"@type\":\"WebPage\",\"@id\":\"https:\\\/\\\/blog.hwr-berlin.de\\\/codeandstats\\\/variable-importance-in-random-forests\\\/\",\"url\":\"https:\\\/\\\/blog.hwr-berlin.de\\\/codeandstats\\\/variable-importance-in-random-forests\\\/\",\"name\":\"Variable Importance in Random Forests - Code and Stats\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/blog.hwr-berlin.de\\\/codeandstats\\\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\\\/\\\/blog.hwr-berlin.de\\\/codeandstats\\\/variable-importance-in-random-forests\\\/#primaryimage\"},\"image\":{\"@id\":\"https:\\\/\\\/blog.hwr-berlin.de\\\/codeandstats\\\/variable-importance-in-random-forests\\\/#primaryimage\"},\"thumbnailUrl\":\"http:\\\/\\\/blog.hwr-berlin.de\\\/codeandstats\\\/wp-content\\\/uploads\\\/2018\\\/08\\\/unnamed-chunk-2-1.png\",\"datePublished\":\"2018-08-11T06:37:52+00:00\",\"dateModified\":\"2020-09-08T05:28:32+00:00\",\"author\":{\"@id\":\"https:\\\/\\\/blog.hwr-berlin.de\\\/codeandstats\\\/#\\\/schema\\\/person\\\/b8b34fa3bb407386693d915a45ba08be\"},\"breadcrumb\":{\"@id\":\"https:\\\/\\\/blog.hwr-berlin.de\\\/codeandstats\\\/variable-importance-in-random-forests\\\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\\\/\\\/blog.hwr-berlin.de\\\/codeandstats\\\/variable-importance-in-random-forests\\\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/blog.hwr-berlin.de\\\/codeandstats\\\/variable-importance-in-random-forests\\\/#primaryimage\",\"url\":\"http:\\\/\\\/blog.hwr-berlin.de\\\/codeandstats\\\/wp-content\\\/uploads\\\/2018\\\/08\\\/unnamed-chunk-2-1.png\",\"contentUrl\":\"http:\\\/\\\/blog.hwr-berlin.de\\\/codeandstats\\\/wp-content\\\/uploads\\\/2018\\\/08\\\/unnamed-chunk-2-1.png\"},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\\\/\\\/blog.hwr-berlin.de\\\/codeandstats\\\/variable-importance-in-random-forests\\\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\\\/\\\/blog.hwr-berlin.de\\\/codeandstats\\\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Variable Importance in Random Forests\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\\\/\\\/blog.hwr-berlin.de\\\/codeandstats\\\/#website\",\"url\":\"https:\\\/\\\/blog.hwr-berlin.de\\\/codeandstats\\\/\",\"name\":\"Code and Stats\",\"description\":\"Statistics, Probability and R\",\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\\\/\\\/blog.hwr-berlin.de\\\/codeandstats\\\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Person\",\"@id\":\"https:\\\/\\\/blog.hwr-berlin.de\\\/codeandstats\\\/#\\\/schema\\\/person\\\/b8b34fa3bb407386693d915a45ba08be\",\"name\":\"Markus L\u00f6cher\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/6ca24d61b3fa96cb1a5f305a4b918469dd9e62da1c9887160357fc3343083247?s=96&d=mm&r=g\",\"url\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/6ca24d61b3fa96cb1a5f305a4b918469dd9e62da1c9887160357fc3343083247?s=96&d=mm&r=g\",\"contentUrl\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/6ca24d61b3fa96cb1a5f305a4b918469dd9e62da1c9887160357fc3343083247?s=96&d=mm&r=g\",\"caption\":\"Markus L\u00f6cher\"}}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Variable Importance in Random Forests - Code and Stats","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/blog.hwr-berlin.de\/codeandstats\/variable-importance-in-random-forests\/","og_locale":"en_US","og_type":"article","og_title":"Variable Importance in Random Forests - Code and Stats","og_description":"Variable Importance in Random Forests can suffer from severe overfitting Predictive vs.\u00a0interpretational overfitting There appears to be broad consenus that random forests rarely suffer from \u201coverfitting\u201d which plagues many other models. (We define overfitting as choosing a model flexibility which is too high for the data generating process at hand resulting in non-optimal performance on &hellip; Continue reading Variable Importance in Random Forests","og_url":"https:\/\/blog.hwr-berlin.de\/codeandstats\/variable-importance-in-random-forests\/","og_site_name":"Code and Stats","article_published_time":"2018-08-11T06:37:52+00:00","article_modified_time":"2020-09-08T05:28:32+00:00","og_image":[{"url":"http:\/\/blog.hwr-berlin.de\/codeandstats\/wp-content\/uploads\/2018\/08\/unnamed-chunk-2-1.png","type":"","width":"","height":""}],"author":"Markus L\u00f6cher","twitter_card":"summary_large_image","twitter_misc":{"Written by":"Markus L\u00f6cher","Est. reading time":"6 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/blog.hwr-berlin.de\/codeandstats\/variable-importance-in-random-forests\/#article","isPartOf":{"@id":"https:\/\/blog.hwr-berlin.de\/codeandstats\/variable-importance-in-random-forests\/"},"author":{"name":"Markus L\u00f6cher","@id":"https:\/\/blog.hwr-berlin.de\/codeandstats\/#\/schema\/person\/b8b34fa3bb407386693d915a45ba08be"},"headline":"Variable Importance in Random Forests","datePublished":"2018-08-11T06:37:52+00:00","dateModified":"2020-09-08T05:28:32+00:00","mainEntityOfPage":{"@id":"https:\/\/blog.hwr-berlin.de\/codeandstats\/variable-importance-in-random-forests\/"},"wordCount":1027,"image":{"@id":"https:\/\/blog.hwr-berlin.de\/codeandstats\/variable-importance-in-random-forests\/#primaryimage"},"thumbnailUrl":"http:\/\/blog.hwr-berlin.de\/codeandstats\/wp-content\/uploads\/2018\/08\/unnamed-chunk-2-1.png","articleSection":["R"],"inLanguage":"en-US"},{"@type":"WebPage","@id":"https:\/\/blog.hwr-berlin.de\/codeandstats\/variable-importance-in-random-forests\/","url":"https:\/\/blog.hwr-berlin.de\/codeandstats\/variable-importance-in-random-forests\/","name":"Variable Importance in Random Forests - Code and Stats","isPartOf":{"@id":"https:\/\/blog.hwr-berlin.de\/codeandstats\/#website"},"primaryImageOfPage":{"@id":"https:\/\/blog.hwr-berlin.de\/codeandstats\/variable-importance-in-random-forests\/#primaryimage"},"image":{"@id":"https:\/\/blog.hwr-berlin.de\/codeandstats\/variable-importance-in-random-forests\/#primaryimage"},"thumbnailUrl":"http:\/\/blog.hwr-berlin.de\/codeandstats\/wp-content\/uploads\/2018\/08\/unnamed-chunk-2-1.png","datePublished":"2018-08-11T06:37:52+00:00","dateModified":"2020-09-08T05:28:32+00:00","author":{"@id":"https:\/\/blog.hwr-berlin.de\/codeandstats\/#\/schema\/person\/b8b34fa3bb407386693d915a45ba08be"},"breadcrumb":{"@id":"https:\/\/blog.hwr-berlin.de\/codeandstats\/variable-importance-in-random-forests\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/blog.hwr-berlin.de\/codeandstats\/variable-importance-in-random-forests\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/blog.hwr-berlin.de\/codeandstats\/variable-importance-in-random-forests\/#primaryimage","url":"http:\/\/blog.hwr-berlin.de\/codeandstats\/wp-content\/uploads\/2018\/08\/unnamed-chunk-2-1.png","contentUrl":"http:\/\/blog.hwr-berlin.de\/codeandstats\/wp-content\/uploads\/2018\/08\/unnamed-chunk-2-1.png"},{"@type":"BreadcrumbList","@id":"https:\/\/blog.hwr-berlin.de\/codeandstats\/variable-importance-in-random-forests\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/blog.hwr-berlin.de\/codeandstats\/"},{"@type":"ListItem","position":2,"name":"Variable Importance in Random Forests"}]},{"@type":"WebSite","@id":"https:\/\/blog.hwr-berlin.de\/codeandstats\/#website","url":"https:\/\/blog.hwr-berlin.de\/codeandstats\/","name":"Code and Stats","description":"Statistics, Probability and R","potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/blog.hwr-berlin.de\/codeandstats\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Person","@id":"https:\/\/blog.hwr-berlin.de\/codeandstats\/#\/schema\/person\/b8b34fa3bb407386693d915a45ba08be","name":"Markus L\u00f6cher","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/secure.gravatar.com\/avatar\/6ca24d61b3fa96cb1a5f305a4b918469dd9e62da1c9887160357fc3343083247?s=96&d=mm&r=g","url":"https:\/\/secure.gravatar.com\/avatar\/6ca24d61b3fa96cb1a5f305a4b918469dd9e62da1c9887160357fc3343083247?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/6ca24d61b3fa96cb1a5f305a4b918469dd9e62da1c9887160357fc3343083247?s=96&d=mm&r=g","caption":"Markus L\u00f6cher"}}]}},"_links":{"self":[{"href":"https:\/\/blog.hwr-berlin.de\/codeandstats\/wp-json\/wp\/v2\/posts\/146","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/blog.hwr-berlin.de\/codeandstats\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.hwr-berlin.de\/codeandstats\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.hwr-berlin.de\/codeandstats\/wp-json\/wp\/v2\/users\/3"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.hwr-berlin.de\/codeandstats\/wp-json\/wp\/v2\/comments?post=146"}],"version-history":[{"count":8,"href":"https:\/\/blog.hwr-berlin.de\/codeandstats\/wp-json\/wp\/v2\/posts\/146\/revisions"}],"predecessor-version":[{"id":186,"href":"https:\/\/blog.hwr-berlin.de\/codeandstats\/wp-json\/wp\/v2\/posts\/146\/revisions\/186"}],"wp:attachment":[{"href":"https:\/\/blog.hwr-berlin.de\/codeandstats\/wp-json\/wp\/v2\/media?parent=146"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.hwr-berlin.de\/codeandstats\/wp-json\/wp\/v2\/categories?post=146"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.hwr-berlin.de\/codeandstats\/wp-json\/wp\/v2\/tags?post=146"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}