How to pipe SQL into R's dplyr?

It looks like you are wanting to combine custom SQL code with auto-generated SQL code from dbplyr. For this it is important to distinguish between:

  • DBI::db* commands - that execute the provided SQL on the database and return the result.
  • dbplyr translation - where you work with a remote connection to a table

You can only combine these in certain ways. Below I have given several examples depending on your particular use case. All assume that DISTINCT is a command that is accepted in your specific SQL environment.

Reference examples that cover many of the different use cases

If you'll excuse some self-promotion, I recommend you take a look at my dbplyr_helpers GitHub repository (here). This includes:

  • union_all function that takes in two tables accessed via dbplyr and outputs a single table using some custom SQL code.
  • write_to_datebase function that takes a table accessed via dbplyr and converts it to code that can be executed via DBI::dbExecute

Automatic piping

dbplyr automatically pipes your code into the next query for you when you are working with standard dplyr verbs for which there are SQL translations defined. So long as sql translations are defined you can chain together many pipes (I used 10 or more at once) with the (almost) only disadvantage being that the sql translated query gets difficult for a human to read.

For example, consider the following:

library(dbplyr)
library(dplyr)

tmp_df = data.frame(col1 = c(1,2,3), col2 = c("a","b","c"))

df1 = tbl_lazy(tmp_df, con = simulate_postgres())
df2 = tbl_lazy(tmp_df, con = simulate_postgres())

df = left_join(df1, df2, by = "col1") %>%
  distinct()

When you then call show_query(df) R returns the following auto-generated SQL code:

SELECT DISTINCT *
FROM (

SELECT `LHS`.`col1` AS `col1`, `LHS`.`col2` AS `col2.x`, `RHS`.`col2` AS `col2.y`
FROM `df` AS `LHS`
LEFT JOIN `df` AS `RHS`
ON (`LHS`.`col1` = `RHS`.`col1`)

) `dbplyr_002`

But not as nicely formatted. Note that the initial command (left join) appears as a nested query, with a distinct in the outer query. Hence df is an R link to a remote database table defined by the above sql query.

Creating custom SQL functions

You can pipe dbplyr into custom SQL functions. Piping means that the thing being piped becomes the first argument of the receiving function.

custom_distinct <- function(df){
  db_connection <- df$src$con

  sql_query <- build_sql(con = db_connection,
                         "SELECT DISTINCT * FROM (\n",
                         sql_render(df),
                         ") AS nested_tbl"
  )
  return(tbl(db_connection, sql(sql_query)))
}

df = left_join(df1, df2, by = "col1") %>%
  custom_distinct()

When you then call show_query(df) R should return the following SQL code (I say 'should' because I can not get this working with simulated sql connections), but not as nicely formatted:

SELECT DISTINCT * FROM (

SELECT `LHS`.`col1` AS `col1`, `LHS`.`col2` AS `col2.x`, `RHS`.`col2` AS `col2.y`
FROM `df` AS `LHS`
LEFT JOIN `df` AS `RHS`
ON (`LHS`.`col1` = `RHS`.`col1`)

) nested_tbl

As with the previous example, df is an R link to a remote database table defined by the above sql query.

Converting dbplyr to DBI

You can take the code from an existing dbplyr remote table and convert it to a string that can be executed using DBI::db*.

As another way of writing a distinct query:

df1 = tbl_lazy(tmp_df, con = simulate_postgres())
df2 = tbl_lazy(tmp_df, con = simulate_postgres())

df = left_join(df1, df2, by = "col1")

custom_distinct2 = paste0("SELECT DISTINCT * FROM (",
                          as.character(sql_render(df)),
                          ") AS nested_table")

local_table =   dbGetQuery(db_connection, custom_distinct2)

Which will return a local R dataframe with the equivalent sql command as per the previous examples.


If you want to do custom SQL processing on the result of a dbplyr operation, it may be useful to compute() first, which creates a new table (temporary or permanent) with the result set on the database. The reprex below shows how to access the name of the newly generated table if you rely on autogeneration. (Note that this relies on dbplyr internals and is subject to change without notice -- perhaps it's better to name the table explicitly.) Then, use dbGetQuery() as usual.

library(tidyverse)
library(dbplyr)
#> 
#> Attaching package: 'dbplyr'
#> The following objects are masked from 'package:dplyr':
#> 
#>     ident, sql

lazy_query <-
  memdb_frame(a = 1:3) %>%
  mutate(b = a + 1) %>%
  summarize(c = sum(a * b, na.rm = TRUE))

lazy_query
#> # Source:   lazy query [?? x 1]
#> # Database: sqlite 3.30.1 [:memory:]
#>       c
#>   <dbl>
#> 1    20

lazy_query_computed <-
  lazy_query %>%
  compute()

lazy_query_computed
#> # Source:   table<dbplyr_002> [?? x 1]
#> # Database: sqlite 3.30.1 [:memory:]
#>       c
#>   <dbl>
#> 1    20
lazy_query_computed$ops$x
#> <IDENT> dbplyr_002

Created on 2020-01-01 by the reprex package (v0.3.0)

If your SQL dialect supports CTEs, you could also extract the query string and use this as part of a custom SQL, perhaps similarly to Simon's suggestion.

library(tidyverse)
library(dbplyr)
#> 
#> Attaching package: 'dbplyr'
#> The following objects are masked from 'package:dplyr':
#> 
#>     ident, sql

lazy_query <-
  memdb_frame(a = 1:3) %>%
  mutate(b = a + 1) %>%
  summarize(c = sum(a * b, na.rm = TRUE))

sql <-
  lazy_query %>%
  sql_render()

cte_sql <-
  paste0(
    "WITH my_result AS (", sql, ") ",
    "SELECT c + 1 AS d FROM my_result"
  )

cte_sql
#> [1] "WITH my_result AS (SELECT SUM(`a` * `b`) AS `c`\nFROM (SELECT `a`, `a` + 1.0 AS `b`\nFROM `dbplyr_001`)) SELECT c + 1 AS d FROM my_result"

DBI::dbGetQuery(
  lazy_query$src$con,
  cte_sql
)
#>    d
#> 1 21

Created on 2020-01-01 by the reprex package (v0.3.0)