Calculating MAPE in H2o: Error: Provided column type POSIXct is unknown

H2O is running in a separate process to R (whether H2O is on the local server or in a distant data centre). The H2O data and the H2O models are kept in that H2O process, and cannot be seen by R.

What dH <- as.h2o(dR) does is copy an R data frame, dR, into H2O's memory space. The dH is then an R variable that describes the H2O data frame. I.e. it is a pointer, or a handle; it is not the data itself.

What dR <- as.data.frame(dH) does is copy the data from the H2O process's memory, into the R process's memory. (as.vector(dH) does the same for when dH describes a single column)

So, the simplest way to modify your mape_calc(), assuming that sub_df is an R data frame, is to change the first two lines as follows:

mape_calc <- function(sub_df) {
  p <- h2o.predict(rforest.model, as.h2o(sub_df))
  pred <- as.vector(p)

  actual <- sub_df$Ptot
  mape <- 100 * mean(abs((actual - pred)/actual))

  new_df <- data.frame(date = sub_df$date[[1]], mape = mape)

  return(new_df)
}

I.e. upload sub_df to H2O, and give that to h2o.predict(). Then use as.vector() to download the prediction that was made.

This was relative to your original code. So keep the original version of this:

# LIST OF ONE-ROW DATAFRAMES
df_list <- by(test_data, test_data$date, map_calc)

I.e. don't use by() directly on test_h2o.


UPDATE based on edited question:

I made two changes to your example code. First, I removed the date column from sub_df. That was what was causing the error message.

The second change was just to simplify the return type; not important, but you ended up with the date column duplicated, before.

mape_calc <- function(sub_df) {
  sub_df_minus_date <- subset(sub_df, select=-c(date))
  p <- h2o.predict(my_gbm, as.h2o(sub_df_minus_date))
  pred <- as.vector(p)
  actual <- sub_df$medv
  mape <- 100 * mean(abs((actual - pred)/actual))
  data.frame(mape = mape)
}

ASIDE: h2o.predict() is most efficient when working on a batch of data to make predictions on. Putting h2o.predict() inside a loop is a code smell. You would be better to call h2o.predict(rforest.model, test_h2o) once, outside the loop, then download the predictions into R, and cbind them to test_data, and then use by on that combined data.

UPDATE Here is your example changed to work that way: (I've added the prediction as an extra column to the test data; there are other ways to do it, of course)

 test_h2o <- as.h2o(subset(test_data_finialized, select=-c(date)))
 p <- h2o.predict(my_gbm, test_h2o)
 test_data_finialized$pred = as.vector(p)

 mape_calc2 <- function(sub_df) {
   actual <- sub_df$medv
   mape <- 100 * mean(abs((actual - sub_df$pred)/actual))
   data.frame(mape = mape)
 }

 df_list <- by(test_data_finialized, test_data_finialized$date, mape_calc2)

You should notice that it runs much quicker.

ADDITIONAL UPDATE: by() works by grouping same values of your 2nd argument, and processing them together. As all your timestamps are different, you are processing one row at a time.

Look into the xts library, and e.g. apply.daily() to group timestamps. But for the simple case of wanting to process by date, there is a simple hack. Change your by() line to:

df_list <- by(test_data_finialized, as.Date(test_data_finialized$date), mape_calc2)

Using as.Date() will strip off the times. Therefore all the rows on the same day now look the same and get processed together.

ASIDE 2: You would get better responses if your make the infamous minimal example. Then people can run your code, and they can test their answers. It is also often better to use a simple data set everyone has, e.g. iris, rather than your own data. (You can do regression on any of the first 4 fields; using iris does not have to always be about predicting the species.)

ASIDE 3: You can do MAPE completely inside H2O, as the abs() and mean() functions will work directly on H2O data frames (as do lots of other things - see the H2O manual): https://stackoverflow.com/a/43103229/841830 (I'm not marking this as a duplicate, as your question was how to adapt by() for use with H2O data frames, not how to calculate MAPE efficiently!)


It looks like you are mixing up R and H2O data types. Remember H2O's R is simply an R API and is not the same as native R. This means that you can't apply an R function that expects an R dataframe to an H2OFrame. And likewise you can't apply an H2O Function to an R dataframe when it expects an H2OFrame.

As you can see from the R docs on by it's a function that expects "an R object, normally a data frame, possibly a matrix" so you can't pass in an H2O frame.

Similarly you are passing date = H2OFrame to data.frame().

However you can use the as.data.frame() to convert an H2OFrame to an R dataframe and then go about your calculations entirely in R.