Using sampleRandom() from large raster without NA values in R?

It looks like this is an artifact of the sampleRandom package you are using.

If you check the documentation, it states that:

With argument na.rm=TRUE, the returned sample may be smaller than requested

Random sampling of raster using R? might provide you with an alternative way to perform this analysis.


Just pad your desired number of random samples and then sample back down to the correct n. This should account for the occasional NA that are produced and subsequently removed with the na.rm=TRUE argument.

    require(raster)
    # Create example data
    r1 <- raster(ncols=500, nrows=500, xmn=0)
      r1[] <- runif(ncell(r1))
    r2 <- raster(ncols=500, nrows=500, xmn=0)
      r2[] <- runif(ncell(r2))  
    r <- stack(r1,r2)

    # Sample size
    n=50

    # Random sample of raster  
    r.samp <- sampleRandom(r, size=(n+20), na.rm=TRUE, sp=FALSE, asRaster=FALSE) 
      dim( r.samp )[1]

   # Create a random sample of n size to subset r.samp
   #   (works with dataframe, matrix and sp objects)
   r.samp <- r.samp[sample( 1:dim(r.samp)[1], n),]
    dim ( r.samp )[1]

If you can read the raster into memory an approach in sp would be to use rgdal to create a SpatialGridDataFrame the coerce it into a SpatialPointsDataFrame so you can easily remove NA's and end up with a point object of your subsample. You can then sample subsequent rasters using this sp point object. The @data dataframe can be extract and coerced into a matrix for your purposes.

require(sp)
require(rgdal)
require(raster)

n=50 # Number of random samples

# Read raster data using rgdal, results in SpatialGridDataFrame 
r <- readGDAL(system.file("external/test.ag", package="sp")[1])
  class(r)
    spplot(r, "band1")

# Coerce into SpatialPointsDataFrame    
r <- as(r, "SpatialPointsDataFrame")      

# remove NA's   
r@data <- na.omit(r.pts@data)
  plot(r, pch=20)

# Create random sample. Object is a SpatialPointsDataFrame     
r.samp <- r[sample(1:dim(r)[1], n),]
  plot(r.samp, pch=20, col="red", add=TRUE)   
    class(r.samp)

#  Use r.samp sp object for additional sampling 
#    Add extra column and coerce to raster stack
r2 <- readGDAL(system.file("external/test.ag", package="sp")[1])
  r2@data <- data.frame(r2@data, band2=runif(dim(r2)[1]) ) 
    r2 <- stack(r2)

# Extract raster values using r.samp object
r.samp@data <- data.frame(r.samp@data, band2=extract(r2[[2]], r.samp))
  str(r.samp@data)

I was able to reproduce the problem in the third example.

A workaround uses built-in procedures. There are several options, but one convenient method is just to select each cell in the grid uniformly at random and independently with a probability large enough to assure at least n=2000 (or whatever) non-null cells will be selected, but not much more than that. That can be accomplished by computing the standard deviation of the proportion of all cells that will be selected (which has a Binomial distribution) and adding a small multiple of that standard deviation to the desired proportion. A multiple around 6 virtually guarantees at least n cells will be selected. In the example code below, 2020 cells were selected where 2000 were needed.

This method is a little inefficient compared to repeating the built-in sampleRandom procedure. Unlike the latter, though, this method samples without replacement.


This code continues in the context of Example 3 of the question: it uses the r_mask grid for input and requires the raster library to be loaded in order to use getValues.

set.seed(17)
n.sample <- 2000 # Number of non-null cells to sample
system.time({
  m <- dim(r_mask)[1]
  n <- dim(r_mask)[2]
  k <- sum(!is.na(getValues(r_mask))) # Number of non-null cells
  p <- n.sample / k                   # Proportion of them to be sampled
  pp <- p + 6*sqrt(p*(1-p)/(m*n))     # Proportion to request

  z <- matrix(runif(m*n) < pp, nrow=m)# Indicator of cells to select
  x <- unlist(apply(z, 2, function(col) (1:m)[col]))    # X-coordinates
  y <- unlist(apply(t(z), 2, function(row) (1:n)[row])) # Y-coordinates
  z <- getValues(r_mask)[z]           # Values
  i <- !is.na(z)                      # Indicator of non-null values
  a <- cbind(x[i], y[i], z[i])        # Result (with too many rows)
  print(dim(a))
  a <- a[1:n.sample, ]                # Remove any unneeded rows
})