How to efficiently sort the characters in a string in R?

Re-implementing using stringi gives a roughly 4x speedup. I also edited sort_cat to use fixed = TRUE in the strsplit, which makes it a little faster. And thanks to Carl for the single loop suggestion, which speeds us up just a little bit more.

sort_cat <- function(strings){
  tmp <- strsplit(strings, split="", fixed = TRUE)
  tmp <- lapply(tmp, sort)
  tmp <- lapply(tmp, paste0, collapse = "")
  tmp <- unlist(tmp)
  return(tmp)
}

library(stringi)
sort_stringi = function(s) {
  s = stri_split_boundaries(s, type = "character")
  s = lapply(s, stri_sort)
  s = lapply(s, stri_join, collapse = "")
  unlist(s)
}

sort_stringi_loop = function(s) {
  s = stri_split_boundaries(s, type = "character")
  for (i in seq_along(s)) {
    s[[i]] = stri_join(stri_sort(s[[i]]), collapse = "")
  }
  unlist(s)
}

bench::mark(
  sort_cat(strings),
  sort_stringi(strings),
  sort_stringi_loop(strings)
)
# # A tibble: 3 x 13
#   expression                    min median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory
#   <bch:expr>                 <bch:> <bch:>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list>
# 1 sort_cat(strings)          23.01s 23.01s    0.0435    31.2MB     2.17     1    50     23.01s <chr ~ <Rpro~
# 2 sort_stringi(strings)       6.16s  6.16s    0.162     30.5MB     2.11     1    13      6.16s <chr ~ <Rpro~
# 3 sort_stringi_loop(strings)  5.75s  5.75s    0.174     15.3MB     1.74     1    10      5.75s <chr ~ <Rpro~
# # ... with 2 more variables: time <list>, gc <list>

This method could also be used in parallel. Profiling the code to see which operations actually take the longest would be a good next step if you want to go even faster.


You can reduce time by minimizing the number of loops for sure, and further do so by using the parallel package... my approach would be split strings once, then in the loop sort and paste:

sort_cat <- function(strings){
    tmp <- strsplit(strings, split="")
    tmp <- lapply(tmp, sort)
    tmp <- lapply(tmp, paste0, collapse = "")
    tmp <- unlist(tmp)
    return(tmp)
}

sort_cat2 <- function(strings){
    unlist(mcMap(function(i){
        stri_join(sort(i), collapse = "")
    }, stri_split_regex(strings, "|", omit_empty = TRUE, simplify = F), mc.cores = 8L))
}

> microbenchmark::microbenchmark(
+     old = sort_cat(strings[1:500000]),
+     new = sort_cat2(strings[1:500000]),
+     times = 1
+ )
Unit: seconds
 expr        min         lq       mean     median         uq        max neval
  old 9.62673395 9.62673395 9.62673395 9.62673395 9.62673395 9.62673395     1
  new 5.10547437 5.10547437 5.10547437 5.10547437 5.10547437 5.10547437     1

Shaves like 4 seconds, but it's still not that fast...

Edit

Okay got it way down using apply.. strategy here:

1) extract letters rather than split boundaries 2) create a matrix with the results 3) iterate through row-wise 4) Sort 5) Join

You avoid multiple loops and unlisting.... IGNORE: ?caveat is if strings different lengths, you'll need to remove any empty or NA within the apply such as i[!is.na(i) && nchar(i) > 0]

sort_cat3 <- function(strings){
    apply(stri_extract_all_regex(strings, "\\p{L}", simplify = TRUE), 1, function(i){
        stri_join(stri_sort(i), collapse = "")
    })
}

> microbenchmark::microbenchmark(
+     old = sort_cat(strings[1:500000]),
+     mapping = sort_cat2(strings[1:500000]),
+     applying = sort_cat3(strings[1:500000]),
+     times = 1
+ )
Unit: seconds
     expr         min          lq        mean      median          uq         max neval
      old 10.35101934 10.35101934 10.35101934 10.35101934 10.35101934 10.35101934     1
  mapping  5.12771799  5.12771799  5.12771799  5.12771799  5.12771799  5.12771799     1
 applying  3.97775326  3.97775326  3.97775326  3.97775326  3.97775326  3.97775326     1

Takes us from 10.3 secs to 3.98