Find nearest points of latitude and longitude from different data sets with different length

Here is an other possible solution:

library(rgeos)
set1sp <- SpatialPoints(set1)
set2sp <- SpatialPoints(set2)
set1$nearest_in_set2 <- apply(gDistance(set1sp, set2sp, byid=TRUE), 1, which.min)

head(set1)
       lon      lat nearest_in_set2
## 1 13.67111 48.39167              10
## 2 12.86695 48.14806              10
## 3 15.94223 48.72111              10
## 4 11.09974 47.18917               1
## 5 12.95834 47.05444               1
## 6 14.20389 47.12917               1

You can use a series of apply commands to do this. Note that the x and y in the functions refer to set1 and set2 rather than the lat lon coords - the lat lon coords are specified as p1 and p2. [NOTE: Edited to correct order of set1 and set2 in calculations - the order determines if you are calculating the value in set2 closest to each value in set 1 or vice-versa)

distp1p2 <- function(p1,p2) {
    dst <- sqrt((p1[1]-p2[1])^2+(p1[2]-p2[2])^2)
    return(dst)
}

dist2 <- function(y) min(apply(set2, 1, function(x) min(distp1p2(x,y))))

apply(set1, 1, dist2)

Or if you want the station with the nearest point rather than the min distance change min to which.min in dist2()

dist2b <- function(y) which.min(apply(set2, 1, function(x) min(distp1p2(x,y))))
apply(set1, 1, dist2b)

And to get the lat-lon for that station

set2[apply(set1, 1, dist2b),]

If you have extremely large datasets, using a distance command can be cumbersome as it must calculate the distance to all points in the alternative data for each point in the reference data. The 'ann' command from the 'yaImpute' package is a very fast approximate nearest-neighbour routine that is good for large distance calculations. It will return however many "closest" records you want (the value of k) as well as the distance to each of them.

Note: despite being an approximate nearest neighbour, the results are stable on repeated runs of the same data. It doesn't include a random selection of points or anything. See documentation.

FWIW, I'm really not kidding about fast. I've used this to find knn distances for two matrices, each with millions of points. Making a distance matrix for this or doing it iteratively row-by-row is either unfeasible or painfully slow.

Quick example:

# Hypothetical coordinate data
set.seed(2187); foo1 <- round(abs(data.frame(x=runif(5), y=runif(5))*100))
set.seed(2187); foo2 <- round(abs(data.frame(x=runif(10), y=runif(10))*100))
foo1; foo2

# the 'ann' command from the 'yaImpute' package
install.packages("yaImpute")
library(yaImpute)

# Approximate nearest-neighbour search, reporting 2 nearest points (k=2)
# This command finds the 3 nearest points in foo2 for each point in foo1
# In the output:
#   The first k columns are the row numbers of the points
#   The next k columns (k+1:2k) are the *squared* euclidean distances
knn.out <- ann(as.matrix(foo2), as.matrix(foo1), k=3)
knn.out$knnIndexDist

     [,1] [,2] [,3] [,4] [,5] [,6]
[1,]    1    5    4  729 1658 2213
[2,]    2    3    7   16  100 1025
[3,]    9    7    5   40   81  740
[4,]    4    1    6   16  580  673
[5,]    5    7    9    0  677  980

https://cran.r-project.org/web/packages/yaImpute/index.html

Tags:

R