Unsupervised classification with kmeans in R

I think you can't... You first have to label each classes to compare them. Kmean classify unsupervisedly so without any prior information and so cannot define any kind of classes.

If you have a reference layer, you can make a labelling by a majority voting. Here's a quite more efficient code for majority voting than using the 'raster' package function zonal :

require (data.table)
fun <- match.fun(modal)
vals <- getValues(ref) 
zones <- round(getValues(class_file), digits = 0) 
rDT <- data.table(vals, z=zones) 
setkey(rDT, z) 
zr<-rDT[, lapply(.SD, modal,na.rm=T), by=z]

where ref is your raster class reference file, class_file is your kmeans result.

zr gives you in first col the 'zone' number and in second col, the label for the class.


To implement clustering on an image stack, you do not do it band-by-band but rather on the entire image stack simultaneously. Otherwise, as pointed out by @nmatton, the statistic does not make much sense.

However, I do not agree that this is not possible, just memory intensive. On real satellite data this will be a huge problem, and perhaps impossible on high resolution data, but you can process in memory by coercing your raster(s) into a single object that can be passed to a clustering function. You will need to track NA values across rasters because they will be removed during clustering and you will need to know the positions in the raster so you can assign the cluster values to the correct cells.

We can step through one approach here. Lets add the required libraries and some example data (the RGB R logo to give us 3 bands to work with).

library(raster)
library(cluster)
r <- stack(system.file("external/rlogo.grd", package="raster")) 
  plot(r)

First, We can coerce our multi-band raster stack object to a data.frame using getValues. Note that I am adding an NA value at row 1, column 3 so I can illustrate how to deal with no data.

r.vals <- getValues(r[[1:3]])
  r.vals[1,][3] <- NA

Here, we can get down to business and create a cell index of the non-NA values that will be used to assign the cluster results.

idx <- 1:ncell(r)
idx <- idx[-unique(which(is.na(r.vals), arr.ind=TRUE)[,1])]  

Now, we create a cluster object from the 3 band RGB values with k=4. I am using the clara K-Medoids method because it is good with large data and is better with odd distributions. It is very similar to K-Means.

clus <- cluster::clara(na.omit(scale(r.vals)), k=4)

For simplicity sake, we can create an empty raster by pulling one of the raster bands from our original raster stack object and assigning it NA values.

r.clust <- r[[1]]
r.clust[] <- NA

Finally, using the index, we assign the cluster values to the appropriate cell in the empty raster and plot the results.

r.clust[idx] <- clus$clustering
plot(r.clust) 

For huge rasters you may want to look into the bigmemory package which writes matrices to disk and the operates on blocks and there is a k-means function available. Also, keep in mind that this not exactly what R was designed for and that an image processing or GIS software may be more appropriate. I know that SAGA and the Orfeo toolbox are both free software that have k-means clustering available for image stacks. There is even an RSAGA library that allows the software to be called from R.