Use R to Efficiently Order Randomly Generated Transects

As @chinsoon12 points out hidden in your problem you have an (Asymmetric) Traveling Salesman Problem. The asymmetry arises because the start and end points of your transecs are different.

ATSP is a renowned NP-complete problem. So exact solutions are very difficult even for medium sized problems (see wikipedia for more info). Hence the best we can do in most cases is approximations or heuristics. As you mention there are thousands of transects this is at least a medium sized problem.

Rather than code an ATSP approximation algorithm from the start, there is an existing TSP library for R. This includes several approximation algorithms. Reference documentation is here.

The follow is my use of the TSP package applied to your problem. Beginning with setup (assume I have run StPt, StID, EndPt, and EndID as in your question.

install.packages("TSP")
library(TSP)
library(dplyr)

# Dataframe
df <- cbind.data.frame(StPt, StID, EndPt, EndID)
# filter to 6 example nodes for requested comparison
df = df %>% filter(StID %in% c(1,3,4,5,8,10))

We shall use ATSP from a distance matrix. Position [row,col] in the matrix is the cost/distance of going from (the end of) transect row to (the start of) transect col. This code creates the entire distance matrix.

# distance calculation
transec_distance = function(end,start){
  abs_dist = abs(start-end)
  ifelse(360-abs_dist > 180, abs_dist, 360-abs_dist)
}

# distance matrix
matrix_distance = matrix(data = NA, nrow = nrow(df), ncol = nrow(df))

for(start_id in 1:nrow(df)){
  start_point = df[start_id,'StPt']

  for(end_id in 1:nrow(df)){
    end_point = df[end_id,'EndPt']
    matrix_distance[end_id,start_id] = transec_distance(end_point, start_point)
  }
}

Note that there are more effective ways to construct a distance matrix. However, I have chosen this approach for its clarity. Depending on your computer and the exact number of transects this code may run very slowly.

Also, note that the size of this matrix is quadratic to the number of transects. So for a large number of transects, you will discover there is not enough memory.

The solving is very unexciting. The distance matrix gets turned into a ATSP object, and the ATSP object gets passed to the solver. We then proceed to add the ordering/traveling information to the original df.

answer = solve_TSP(as.ATSP(matrix_distance))
# get length of cycle
print(answer)

# sort df to same order as solution
df_w_answer = df[as.numeric(answer),]
# add info about next transect to each transect
df_w_answer = df_w_answer %>%
  mutate(visit_order = 1:nrow(df_w_answer)) %>%
  mutate(next_StID = lead(StID, order_by = visit_order),
         next_StPt = lead(StPt, order_by = visit_order))
# add info about next transect to each transect (for final transect)
df_w_answer[df_w_answer$visit_order == nrow(df_w_answer),'next_StID'] =
  df_w_answer[df_w_answer$visit_order == 1,'StID']
df_w_answer[df_w_answer$visit_order == nrow(df_w_answer),'next_StPt'] =
  df_w_answer[df_w_answer$visit_order == 1,'StPt']
# compute distance between end of each transect and start of next
df_w_answer = df_w_answer %>% mutate(dist_between = transec_distance(EndPt, next_StPt))

At this point we have a cycle. You can pick any node as the starting point, follow the order given in the df: from EndID to next_StID, and you will cover every transect in (a good approximation to) the minimum distance.

However in your 'intended outcome' you have a path solution (e.g. start at transect 1 and finish at transect 10). We can turn the cycle into a path by excluding the single most expensive transition:

# as path (without returning to start)
min_distance = sum(df_w_answer$dist_between) - max(df_w_answer$dist_between)
path_start = df_w_answer[df_w_answer$dist_between == max(df_w_answer$dist_between), 'next_StID']
path_end = df_w_answer[df_w_answer$dist_between == max(df_w_answer$dist_between), 'EndID']

print(sprintf("minimum cost path = %.2f, starting at node %d, ending at node %d",
              min_distance, path_start, path_end))

Running all the above gives me a different, but superior, answer to your intended outcome. I get the following order: 1 --> 5 --> 8 --> 4 --> 3 --> 10 --> 1.

  • You path from transect 1 to transect 10 has a total distance of 428, if we also returned from transect 10 to transect 1, making this a cycle, the total distance would be 483.
  • Using the TSP package in R we get a path from 1 to 10 with total distance 377, and as a cycle 431.
  • If we instead start at node 4 and end at node 8, we get a total distance of 277.

Some additional nodes:

  • Not all TSP solvers are deterministic, hence you may get some variation in your answer if you run again, or run with the input rows in a different order.
  • TSP is a much more general problem that the transect problem you described. It is possible that your problem has enough additional/special features that means it can be solved perfectly in a reasonable length of time. But this moves your problem into the realm of mathematics.
  • If you are running out of memory to create the distance matrix, take a look at the documentation for the TSP package. It contains several examples that use geo-coordinates as inputs rather than a distance matrix. This is a much smaller input size (presumably the package calculates the distances on the fly) so if you convert the start and end points to coordinates and specify euclidean (or some other common distance function) you could get around (some) computer memory limits.