Hyper-parameter tuning using pure ranger package in R

Note that mlr per default disables the internal parallelization of ranger. Set hyperparameter num.threads to the number of cores available to speed mlr up:

learner <- makeLearner("classif.ranger", num.threads = 4)

Alternatively, start a parallel backend via

parallelStartMulticore(4) # linux/osx
parallelStartSocket(4)    # windows

before calling tuneParams to parallelize the tuning.


I think there are at least two errors:

First, the function ranger does not have a parameter called training_data. Your error message Error in ranger(Species ~ ., training_data = iris, num.trees = 200) : unused argument (training_data = iris) refers to that. You can see that when you look at ?ranger or args(ranger).

Second, the function csrf, on the other hand, has training_data as input, but also requires test_data. Most importantly, these two arguments do not have any defaults, implying that you must provide them. The following works without problems:

fit.rf = ranger(
  Species ~ ., data = iris,
  num.trees = 200
)

fit.rf.tune = csrf(
Species ~ .,
training_data = iris,
test_data = iris,
params1 = list(num.trees = 25, mtry=4),
params2 = list(num.trees = 50, mtry=4)
)

Here, I have just provided iris as both training and test dataset. You would obviously not want to do that in your real application. Moreover, note that ranger also take num.trees and mtry as input, so you could try tuning it there.


Another way to tune the model is to create a manual grid, maybe there are better ways to train the model but this may be a different option.

hyper_grid <- expand.grid(
  mtry       = 1:4,
  node_size  = 1:3,
  num.trees = seq(50,500,50),
  OOB_RMSE   = 0
)

system.time(
  for(i in 1:nrow(hyper_grid)) {
    # train model
    rf <- ranger(
      formula        = Species ~ .,
      data           = iris,
      num.trees      = hyper_grid$num.trees[i],
      mtry           = hyper_grid$mtry[i],
      min.node.size  = hyper_grid$node_size[i],
      importance = 'impurity')
    # add OOB error to grid
    hyper_grid$OOB_RMSE[i] <- sqrt(rf$prediction.error)
  })
user  system elapsed 
3.17    0.19    1.36

nrow(hyper_grid) # 120 models
position = which.min(hyper_grid$OOB_RMSE)
head(hyper_grid[order(hyper_grid$OOB_RMSE),],5)
     mtry node_size num.trees     OOB_RMSE
6     2         2        50 0.1825741858
23    3         3       100 0.1825741858
3     3         1        50 0.2000000000
11    3         3        50 0.2000000000
14    2         1       100 0.2000000000

# fit best model
rf.model <- ranger(Species ~ .,data = iris, num.trees = hyper_grid$num.trees[position], importance = 'impurity', probability = FALSE, min.node.size = hyper_grid$node_size[position], mtry = hyper_grid$mtry[position])
rf.model
Ranger result

Call:
 ranger(Species ~ ., data = iris, num.trees = hyper_grid$num.trees[position], importance = "impurity", probability = FALSE, min.node.size = hyper_grid$node_size[position], mtry = hyper_grid$mtry[position]) 

    Type:                             Classification 
Number of trees:                  50 
Sample size:                      150 
Number of independent variables:  4 
Mtry:                             2 
Target node size:                 2 
Variable importance mode:         impurity 
Splitrule:                        gini 
OOB prediction error:             5.33 % 

I hope it serves you.


To answer my (unclear) question, apparently ranger has no built-in CV/GridSearch functionality. However, here's how you do hyper-parameter tuning with ranger (via a grid search) outside of caret. Thanks goes to Marvin Wright (the maintainer of ranger) for the code. Turns out caret CV with ranger was slow for me because I was using the formula interface (which should be avoided).

ptm <- proc.time()
library(ranger)
library(mlr)

# Define task and learner
task <- makeClassifTask(id = "iris",
                        data = iris,
                        target = "Species")

learner <- makeLearner("classif.ranger")

# Choose resampling strategy and define grid
rdesc <- makeResampleDesc("CV", iters = 5)
ps <- makeParamSet(makeIntegerParam("mtry", 3, 4),
                   makeDiscreteParam("num.trees", 200))

# Tune
res = tuneParams(learner, task, rdesc, par.set = ps,
           control = makeTuneControlGrid())

# Train on entire dataset (using best hyperparameters)
lrn = setHyperPars(makeLearner("classif.ranger"), par.vals = res$x)
m = train(lrn, iris.task)

print(m)
print(proc.time() - ptm) # ~6 seconds

For the curious, the caret equivalent is

ptm <- proc.time()
library(caret)
data(iris)

grid <-  expand.grid(mtry = c(3,4))

fitControl <- trainControl(method = "CV",
                           number = 5,
                           verboseIter = TRUE)

fit = train(
  x = iris[ , names(iris) != 'Species'],
  y = iris[ , names(iris) == 'Species'],
  method = 'ranger',
  num.trees = 200,
  tuneGrid = grid,
  trControl = fitControl
)
print(fit)
print(proc.time() - ptm) # ~2.4 seconds

Overall, caret is the fastest way to do a grid search with ranger if one uses the non-formula interface.