cost function in cv.glm of boot library in R

It sounds like you might do well to just use the cost function (i.e. the one named cost) defined further down in the "Examples" section of ?cv.glm. Quoting from that section:

 # [...] Since the response is a binary variable an
 # appropriate cost function is
 cost <- function(r, pi = 0) mean(abs(r-pi) > 0.5)

This does essentially what you were trying to do with your example. Replacing your "no" and "yes" with 0 and 1, lets say you have two vectors, predict and response. Then cost() is nicely designed to take them and return the mean classification rate:

## Simulate some reasonable data
set.seed(1)
predict <- seq(0.1, 0.9, by=0.1)
response <-  rbinom(n=length(predict), prob=predict, size=1)
response
# [1] 0 0 0 1 0 0 0 1 1

## Demonstrate the function 'cost()' in action
cost(response, predict)
# [1] 0.3333333  ## Which is right, as 3/9 elements (4, 6, & 7) are misclassified
                 ## (assuming you use 0.5 as the cutoff for your predictions).

I'm guessing the trickiest bit of this will be just getting your mind fully wrapped around the idea of passing a function in as an argument. (At least that was for me, for the longest time, the hardest part of using the boot package, which requires that move in a fair number of places.)

Added on 2016-03-22:

The function cost(), given above is in my opinion unnecessarily obfuscated; the following alternative does exactly the same thing but in a more expressive way:

cost <- function(r, pi = 0) { 
        mean((pi < 0.5) & r==1 | (pi > 0.5) & r==0)
}

I will try to explain the cost function in simple words. Let's take cv.glm(data, glmfit, cost, K) arguments step by step:

data The data consists of many observations. Think of it like series of numbers or even.
glmfit It is generalized linear model, which runs on the above series. But there is a catch it splits data into several parts equal to K. And runs glmfit on each of them separately (test set), taking the rest of them as training set. The output of glmfit is a series consisting of same number of elements as the split input passed.
cost Cost Function. It takes two arguments first the split input series(test set), and second the output of glmfit on the test input. The default is mean square error function. . It sums the square of difference between observed data point and predicted data point. Inside the function a loop runs over the test set (output and input should have same number of elements) calculates difference, squares it and adds to output variable.
K The number to which the input should be split. Default gives leave one out cross validation.

Judging from your cost function description. Your input(x) would be a set of numbers between 0 and 1 (0-0.5 = no and 0.5-1 = yes) and output(y) is 'yes' or 'no'. So error(e) between observation(x) and prediction(y) would be :

cost<- function(x, y){
  e=0
  for (i in 1:length(x)){
    if(x[i]>0.5)
    {
      if( y[i]=='yes') {e=0}
      else {e=x[i]-0.5}
    }else
    {
      if( y[i]=='no') {e=0}
      else {e=0.5-x[i]}
    }
    e=e*e #square error
  }
  e=e/i #mean square error
  return (e)
}

Sources : http://www.cs.cmu.edu/~schneide/tut5/node42.html

cost function in cv.glm of boot library in R

Tags:

R

Glm

Cross Validation

Related

Recent Posts