R: Detect a "main" Path and remove or filter the GPS trace maybe using a kernel?

Edits below one for a more correct & complete answer, the other for a faster one.

This solution works for this case, but I'm not sure it will work in cases that aren't similarly shaped. There are a few parameters that can be adjusted that might find better results. It relies heavily on the sf package and classes.

The code below will:

  • Start with all the points as an sf object
  • Connect each to (an adjustable) number of its nearest neighbors
  • Remove the connections that are too far off the path
  • Create a network
  • Find the shortest path (which will have too few points from the original data)
  • Get the missing points back
libary(sf)
library(tidyverse) ## <- heavy, but it's easy to load the whole thing
library(tidygraph) ##  I'm not sure this was needed
library(nngeo)
library(sfnetworks) ## https://github.com/luukvdmeer/sfnetworks


path_sf <- st_as_sf(path, coords = c('lon', 'lat')

# create a buffer around a connected line of our points.
#  used later to filter out unwanted edges of graph
path_buffer <- 
  path_sf %>%
   st_combine() %>%
   st_cast('MULTILINESTRING') %>%
   st_buffer(dist = .001)         ## dist = arg will depend on projection CRS.


# Connect each point to its 20 nearest neighbors,
#  probably overkill, but it works here.  Problems occur when points on the path
#  have very uneven spacing. A workaround would be to st_sample a linestring of the path
connected20 <- st_connect(path_sf, path_sf,
                          ids = st_nn(path_sf, path_sf, k = 20))

What we have so far:

ggplot() + 
  geom_sf(data = path_sf) + 
  geom_sf(data = path_buffer, color = 'green', alpha = .1) +
  geom_sf(data = connected20, alpha = .1)

Points, Buffer & Connections

Now we need to get rid of the connections outside path_buffer.

# Remove unwanted edges outside the buffer
edges <- connected20[st_within(connected20,
                               path_buffer,
                               sparse = F),] %>%
  st_as_sf()

ggplot(edges) + geom_sf(alpha = .2) + theme_void()

Clipped edges

## Create a network from the edges
net <- as_sfnetwork(edges, directed = T) ########## directed?

## Use network to find shortest path from the first point to the last. 
## This will exclude some original points,
##  we'll get them back soon.
shortest_path <- st_shortest_paths(net,
                                   path_sf[1,],
                                   path_sf[nrow(path_sf),])

# Probably close to the shortest path, the turn looks long
short_ish <- path_sf[shortest_path$vpath[[1]],] 

The plot of short_ish shows that some points are probably missing:

enter image description here

# Use this to regain all points on the shortest path
short_buffer <- short_ish %>%
  st_combine() %>%
  st_cast('LINESTRING') %>%
  st_buffer(dist = .001)

short_all <- path_sf[st_within(path_sf, short_buffer, sparse = F), ]

Almost all the points on (what may be) the shortest path:

enter image description here

Adjusting buffer distances dist, and number of nearest neighbors k = 20 might give you a better result. For some reason this misses a couple of points just south of the fork, and might travel too far east at the fork. The nearest neighbors function can also return distances. Removing connections longer than the greatest distance between neighboring points would help.

Edit:

Code below should get a better track after running code above. Image includes original track, shortest path, all points along the shortest track, and the buffer to obtain those points. Start point in green, end point in red.

## Path buffer as above, dist = .002 instead of .001
path_buffer <- 
  path_sf %>%
  st_combine() %>%
  st_cast('MULTILINESTRING') %>%
  st_buffer(dist = .002)        

### Starting point, 1st point of data
p1 <- net %>% activate('nodes') %>%
  st_as_sf() %>% slice(1)

### Ending point, last point of data
p2 <- net %>% activate('nodes') %>%
  st_as_sf() %>% tail(1)

# New short path
shortest_path2 <- net %>% 
  convert(to_spatial_shortest_paths, p1, p2)
# Buffer again to get all points from original
shortest_path_buffer <- shortest_path2 %>%
  activate(edges) %>% 
  st_as_sf() %>% 
  st_cast('MULTILINESTRING') %>%
  st_combine() %>%
  st_buffer(dist = .0018)

# Shortest path, using all points from original data
all_points_short_path <- path_sf[st_within(path_sf, shortest_path_buffer, sparse = F),]

# Plotting
ggplot() + 
  geom_sf(data = p1, size = 4, color = 'green') + 
  geom_sf(data = p2, size = 4, color = 'red') + 
  geom_sf(data = path_sf, color = 'black', alpha = .2) + 
  geom_sf(data = activate(shortest_path2, 'edges') %>% st_as_sf(), color = 'orange', alhpa = .4) + 
  geom_sf(data = shortest_path_buffer, fill = 'blue', alpha = .2) + 
  geom_sf(data = all_points_short_path, color = 'orange', alpha = .4) +
  theme_void()

Final plot, all included

Edit 2 Probably faster, though hard to tell how much with a small dataset. Also, less likely to include all correct points. Misses a few points from original data.

path_sf <- st_as_sf(path, coords = c('lon', 'lat'))


# Higher density is slower, but more complete. 
# Higher k will be fooled by winding paths as incorrect edges aren't buffered out
# in the interest of speed.
density = 200
k = 4
  
start <- path_sf[1, ] %>% st_geometry()
end <- path_sf[dim(path_sf)[1],] %>% st_geometry()

path_sf_samp <- path_sf %>%
  st_combine() %>%
  st_cast('LINESTRING') %>%
  st_line_sample(density = density) %>%
  st_cast('POINT') %>%
  st_union(start) %>%
  st_union(end) %>%
  st_cast('POINT')%>%
  st_as_sf()

connected3 <- st_connect(path_sf_samp, path_sf_samp,
                          ids = st_nn(path_sf_samp, path_sf_samp, k = k))

edges <- connected3 %>%
  st_as_sf()

net <- as_sfnetwork(edges, directed = F) ########## directed?

shortest_path <- net %>% 
  convert(to_spatial_shortest_paths, start, end)

shortest_path_buffer <- shortest_path %>%
  activate(edges) %>% 
  st_as_sf() %>% 
  st_cast('MULTILINESTRING') %>%
  st_combine() %>%
  st_buffer(dist = .0018)

all_points_short_path <- path_sf[st_within(path_sf, shortest_path_buffer, sparse = F),]


ggplot() + 
  geom_sf(data = path_sf, color = 'black', alpha = .2) + 
  geom_sf(data = activate(shortest_path, 'edges') %>% st_as_sf(), color = 'orange', alpha = .4) + 
  geom_sf(data = shortest_path_buffer, fill = 'blue', alpha = .2) + 
  geom_sf(data = all_points_short_path, color = 'orange', alpha = .4) +
  theme_void()

enter image description here


I will make an attempt to answer this question. Here I am using a naive algorithm. Hopefully, other people can propose solutions better than this one.

I guess we can assume that the starting and ending points of your GPS trace are always on the so-called "main path". If this assumption is valid, then we can draw a line between these two points and use that as the reference. Call this the reference line.

The algorithm is:

  1. For each point i of that trace, calculate the distance from the point to the reference line. Call this distance di.
  2. Tabulate the empirical distribution of all di s and select only those points with di below a specific quantile of that distribution. Call this quantile the threshold. Using a higher threshold is logically equivalent to selecting more points.
  3. The main path is, therefore, the route defined by those selected points.

To calculate di, I use the following formula from this Wikipedia webpage:

formula

The code is

distan <- function(lon, lat) {
  x1 <- lon[[1L]]; y1 <- lat[[1L]]
  x2 <- tail(lon, 1L); y2 <- tail(lat, 1L)
  dy <- y2 - y1; dx <- x2 - x1
  abs(dy * lon - dx * lat + x2 * y1 - y2 * x1) / sqrt(dy * dy + dx * dx)
}

path_filter <- function(lon, lat, threshold = 0.6) {
  d <- distan(lon, lat)
  th <- quantile(d, threshold, na.rm = TRUE)
  d <= th
}

The path_filter function returns a logical vector of the same length as the input vector(s), so you can use it like this (assume that path is a data.table):

path[path_filter(lon, lat, 0.6), ]

Now let's see the resultant main paths for different thresholds. I use the following code to plot figures for thresholds 0, 0.1, 0.2, ..., 1.

library(rnaturalearth)
library(ggplot2)
library(dplyr)
library(tidyr)

map <- ne_countries(scale = "small", returnclass = "sf")

df <- 
  path %>% 
  expand(threshold = 0:10 / 10, nesting(counter, lon, lat)) %>% 
  group_by(threshold) %>% 
  filter(path_filter(lon, lat, threshold)) %>% 
  mutate(threshold = paste0("threshold = ", threshold))

ggplot(map) + 
  geom_sf() + 
  geom_point(aes(x = lon, y = lat, group = threshold), size = 0.01, data = df) + 
  coord_sf(xlim = range(df$lon), ylim = range(df$lat)) + 
  facet_wrap(vars(threshold), ncol = 4L) + 
  theme(axis.text.x = element_text(angle = 90, vjust = .5))

The plots are:

enter image description here

Indeed, a higher threshold gives you more points. For your specific case, I guess you would like to use a threshold of about 0.6?


Okay, I've thought about bearings and difference in bearings a bit and have created an approach which simply considers the angle between the bearing of line (i, i+1) and the bearing of line (i+1, i+2). If the angle between these two bearings is greater than some threshold, we delete points i and i+1.

filtered track

library(tidyverse)
library(geosphere)

## This function calculates the difference between two bearings
angle_diff <- function(theta1, theta2){
 theta <- abs(theta1 - theta2) %% 360 
 return(ifelse(theta > 180, 360 - theta, theta))
}

## This function removes points (i, i + 1) if the bearing difference 
## between (i, i+1) and (i+1, i+2) is larger than angle 
filter_function <- function(data, angle){
 data %>% ungroup() %>% 
  (function(X)X %>% 
    slice(-(X %>% 
             filter(bearing_diff > angle)  %>%
             select(counter, counter_2) %>%
             unlist()))) 
}


## This function calculates the bearing of the line (i, i+1)
## It also handles the iteration needed in the while-loop
calc_bearing <- function(data, lead_counter = TRUE){
 data %>% 
  mutate(counter = 1:n(),
         lat2 = lead(lat), 
         lon2 = lead(lon),
         counter_2 = lead(counter)) %>%
  rowwise() %>% 
  mutate(bearing = geosphere::bearing(p1 = c(lat, lon),
                                      p2 = c(lat2, lon2))) %>% 
  ungroup() %>%
  mutate(bearing_diff = angle_diff(bearing, lead(bearing)))
}

## this is our max angle
max_angle = 100

## Here is our while loop which cycles though the path,
## removing pairs of points (i, i+1) which have "inconsistent" 
## bearings. 
filtered <- 
 path %>%
 as_tibble() %>% 
 calc_bearing() %>%
 (function(X){
  while(any(X$bearing_diff > max_angle) &
        !is.na(any(X$bearing_diff > max_angle))){
   X <- X %>% 
    filter_function(angle = max_angle) %>%
    calc_bearing()
  }
  X
 })

## Here we plot the new track
ggplot(filtered, aes(lon, lat)) +
 geom_point() +
 coord_map()