Find the intersection of overlapping ranges in two tables using data.table function foverlaps

@Seth provided the fastest way to solve the problem of intersection overlaps using the data.table foverlaps function. However, this solution did not take into account the fact that the input bed files may have overlapping ranges that needed to be reduced into single regions. @Martin Morgan solved that with his solution using the GenomicRanges package, that did both the intersecting and range reducing. However, Martin's solution didn't use the foverlaps function. @Arun pointed out that the overlapping ranges in different rows within a table was not currently possible using foverlaps. Thanks to the answers provided, and some additional research on stackoverflow, I came up with this hybrid solution.

Create example BED files without overlapping regions within each file.

chr <- c(1:22,"X","Y","MT")

#bedA contains 5 million rows
bedA <- data.table(
    CHROM = as.vector(sapply(chr, function(x) rep(x,200000))),
    START = rep(as.integer(seq(1,200000000,1000)),25),
    STOP = rep(as.integer(seq(500,200000000,1000)),25),
    key = c("CHROM","START","STOP")
    )

#bedB contains 500 thousand rows
bedB <- data.table(
  CHROM = as.vector(sapply(chr, function(x) rep(x,20000))),
  START = rep(as.integer(seq(200,200000000,10000)),25),
  STOP = rep(as.integer(seq(600,200000000,10000)),25),
  key = c("CHROM","START","STOP")
)

Now create a new bed file containing the intersecting regions in bedA and bedB.

#This solution uses foverlaps
system.time(tmpA <- intersectBedFiles.foverlaps(bedA,bedB))

user  system elapsed 
1.25    0.02    1.37 

#This solution uses GenomicRanges
system.time(tmpB <- intersectBedFiles.GR(bedA,bedB))

user  system elapsed 
12.95    0.06   13.04 

identical(tmpA,tmpB)
[1] TRUE

Now, modify bedA and bedB such that they contain overlapping regions:

#Create overlapping ranges
makeOverlaps <-  as.integer(c(0,0,600,0,0,0,600,0,0,0))
bedC <- bedA[, STOP := STOP + makeOverlaps, by=CHROM]
bedD <- bedB[, STOP := STOP + makeOverlaps, by=CHROM]

Test time to intersect bed files with overlapping ranges using either the foverlaps or GenomicRanges fucntions.

#This solution uses foverlaps to find the intersection and then run GenomicRanges on the result
system.time(tmpC <- intersectBedFiles.foverlaps(bedC,bedD))

user  system elapsed 
1.83    0.05    1.89 

#This solution uses GenomicRanges
system.time(tmpD <- intersectBedFiles.GR(bedC,bedD))

user  system elapsed 
12.95    0.04   12.99 

identical(tmpC,tmpD)
[1] TRUE

The winner: foverlaps!

FUNCTIONS USED

This is the function based upon foverlaps, and will only call the GenomicRanges function (reduceBed.GenomicRanges) if there are overlapping ranges (which are checked for using the rowShift function).

intersectBedFiles.foverlaps <- function(bed1,bed2) {
  require(data.table)
  bedKey <- c("CHROM","START","STOP")
  if(nrow(bed1)>nrow(bed2)) {
    bed <- foverlaps(bed1, bed2, nomatch = 0)
  } else {
    bed <- foverlaps(bed2, bed1, nomatch = 0)
  }
  bed[, START := pmax(START, i.START)]
  bed[, STOP := pmin(STOP, i.STOP)]
  bed[, `:=`(i.START = NULL, i.STOP = NULL)]
  if(!identical(key(bed),bedKey)) setkeyv(bed,bedKey)
  if(any(bed[, STOP+1 >= rowShift(START), by=CHROM][,V1], na.rm = T)) {
    bed <- reduceBed.GenomicRanges(bed)
  }
  return(bed)
}

rowShift <- function(x, shiftLen = 1L) {
  #Note this function was described in this thread:
  #http://stackoverflow.com/questions/14689424/use-a-value-from-the-previous-row-in-an-r-data-table-calculation
  r <- (1L + shiftLen):(length(x) + shiftLen)
  r[r<1] <- NA
  return(x[r])
}

reduceBed.GenomicRanges <- function(bed) {
  setnames(bed,colnames(bed),bedKey)
  if(!identical(key(bed),bedKey)) setkeyv(bed,bedKey)
  grBed <- makeGRangesFromDataFrame(bed,
    seqnames.field = "CHROM",start.field="START",end.field="STOP")
  grBed <- reduce(grBed)
  grBed <- data.table(
    CHROM=as.character(seqnames(grBed)),
    START=start(grBed),
    STOP=end(grBed),
    key = c("CHROM","START","STOP"))
  return(grBed)
}

This function strictly used the GenomicRanges package, produces the same result, but is about 10 fold slower that the foverlaps funciton.

intersectBedFiles.GR <- function(bed1,bed2) {
  require(data.table)
  require(GenomicRanges)
  bed1 <- makeGRangesFromDataFrame(bed1,
    seqnames.field = "CHROM",start.field="START",end.field="STOP")
  bed2 <- makeGRangesFromDataFrame(bed2,
    seqnames.field = "CHROM",start.field="START",end.field="STOP")
  grMerge <- suppressWarnings(intersect(bed1,bed2))
  resultTable <- data.table(
    CHROM=as.character(seqnames(grMerge)),
    START=start(grMerge),
    STOP=end(grMerge),
    key = c("CHROM","START","STOP"))
  return(resultTable)
}

An additional comparison using IRanges

I found a solution to collapse overlapping regions using IRanges but it is more than 10 fold slower than GenomicRanges.

reduceBed.IRanges <- function(bed) {
  bed.tmp <- bed
  bed.tmp[,group := { 
      ir <-  IRanges(START, STOP);
      subjectHits(findOverlaps(ir, reduce(ir)))
    }, by=CHROM]
  bed.tmp <- bed.tmp[, list(CHROM=unique(CHROM), 
              START=min(START), 
              STOP=max(STOP)),
       by=list(group,CHROM)]
  setkeyv(bed.tmp,bedKey)
  bed[,group := NULL]
  return(bed.tmp[, -(1:2)])
}


system.time(bedC.reduced <- reduceBed.GenomicRanges(bedC))

user  system elapsed 
10.86    0.01   10.89 

system.time(bedD.reduced <- reduceBed.IRanges(bedC))

user  system elapsed 
137.12    0.14  137.58 

identical(bedC.reduced,bedD.reduced)
[1] TRUE

foverlaps() will do nicely.

First set the keys for both of the tables:

setkey(table1, CHROMOSOME, START, STOP)
setkey(table2, CHROMOSOME, START, STOP)

Now join them using foverlaps() with nomatch = 0 to drop unmatched rows in table2.

resultTable <- foverlaps(table1, table2, nomatch = 0)

Next choose the appropriate values for START and STOP, and drop the extra columns.

resultTable[, START := pmax(START, i.START)]
resultTable[, STOP := pmin(STOP, i.STOP)]
resultTable[, `:=`(i.START = NULL, i.STOP = NULL)]

The overlapping STOP to a future START should be a different question. It's actually one that I have, so maybe I'll ask it and come back to it here when I have a good answer.


In case you're not stuck on a data.table solution, GenomicRanges

source("http://bioconductor.org/biocLite.R")
biocLite("GenomicRanges")

gives

> library(GenomicRanges)
> intersect(makeGRangesFromDataFrame(table1), makeGRangesFromDataFrame(table2))
GRanges object with 5 ranges and 0 metadata columns:
      seqnames     ranges strand
         <Rle>  <IRanges>  <Rle>
  [1]        1 [  5,  10]      *
  [2]        1 [ 20,  50]      *
  [3]        1 [100, 110]      *
  [4]        1 [130, 130]      *
  [5]        Y [  5,  50]      *
  -------
  seqinfo: 3 sequences from an unspecified genome; no seqlengths