R: Group Similar Addresses Together

stringdist::stringsimmatrix allows to compare similarity between strings:

library(dplyr)
library(stringr)
df <- data.frame(Address = c("1 Main Street, Country A, World", 
                             "1 Main St, Country A, World", 
                             "3 Main St, Country A, World", 
                             "2 Side Street, Country A, World", 
                             "2 Side St. PO 5678 Country A,  World"))
                             
stringdist::stringsimmatrix(df$Address)
          1         2         3         4         5
1 1.0000000 0.8709677 0.8387097 0.8387097 0.5161290
2 0.8518519 1.0000000 0.9629630 0.6666667 0.4444444
3 0.8148148 0.9629630 1.0000000 0.6666667 0.4444444
4 0.8387097 0.7096774 0.7096774 1.0000000 0.6774194
5 0.5833333 0.5833333 0.5833333 0.7222222 1.0000000

As you pointed out, in the example above, row 2 and 3 are very similar according to this criteria (96%), whereas house number is different.

You could add another criteria extracting numbers from the strings, and comparing their similarity :

# Extract numbers
nums <- df %>% rowwise %>% mutate(numlist = str_extract_all(Address,"\\(?[0-9]+\\)?"))  

# Create numbers vectors pairs
numpairs <- expand.grid(nums$numlist, nums$numlist)

# Calculate similarity
numsim <- numpairs %>% rowwise %>% mutate(dist = length(intersect(Var1,Var2)) / length(union(Var1,Var2)))

# Return similarity matrix
matrix(numsim$dist,nrow(df))

     [,1] [,2] [,3] [,4] [,5]
[1,]    1    1    0  0.0  0.0
[2,]    1    1    0  0.0  0.0
[3,]    0    0    1  0.0  0.0
[4,]    0    0    0  1.0  0.5
[5,]    0    0    0  0.5  1.0

According to this new criteria, rows 2 and 3 are clearly different.

You could combine these two criteria to decide whether addresses are similar enough, for example :

matrix(numsim$dist,nrow(df)) * stringdist::stringsimmatrix(df$Address)

          1         2 3         4         5
1 1.0000000 0.8709677 0 0.0000000 0.0000000
2 0.8518519 1.0000000 0 0.0000000 0.0000000
3 0.0000000 0.0000000 1 0.0000000 0.0000000
4 0.0000000 0.0000000 0 1.0000000 0.3387097
5 0.0000000 0.0000000 0 0.3611111 1.0000000

To deal with many hundred thousands of addresses, expand.grid won't work on the whole dataset, but you could split / parallelize this by country / area in order to avoid an unfeasible full cartesian product.


Might want to look into OpenRefine, or the refinr package for R, which is much less visual but still good. It has two functions, key_collision_merge and n_gram_merge which has several parameters. If you have a dictionary of good addresses, you can pass that to key_collision_merge.

Probably good to make note of the abbreviations you see often (St., Blvd., Rd., etc.) and replace all of those. Surely there is a good table somewhere of these abbreviations, like https://www.pb.com/docs/US/pdf/SIS/Mail-Services/USPS-Suffix-Abbreviations.pdf.

Then:

library(refinr)    
df <- tibble(Address = c("1 Main Street, Country A, World", 
                             "1 Main St, Country A, World", 
                             "1 Maine St, Country A, World", 
                             "2 Side Street, Country A, World", 
                             "2 Side St. Country A, World",
                              "3 Side Rd. Country A, World",
                              "3 Side Road Country B World"))
df2 <- df %>%
  mutate(address_fix = str_replace_all(Address, "St\\.|St\\,|St\\s", "Street"),
         address_fix = str_replace_all(address_fix, "Rd\\.|Rd\\,|Rd\\s", "Road")) %>%
  mutate(address_merge = n_gram_merge(address_fix, numgram = 1))

df2$address_merge
[1] "1 Main Street Country A, World"
[2] "1 Main Street Country A, World"
[3] "1 Main Street Country A, World"
[4] "2 Side Street Country A, World"
[5] "2 Side Street Country A, World"
[6] "3 Side Road Country A, World"  
[7] "3 Side Road Country B World"