disk.frame supports data.table syntax

library(disk.frame)

# set-up disk.frame to use multiple workers
if(interactive()) {
  setup_disk.frame()
  # highly recommended, however it is pun into interactive() for CRAN because
  # change user options are not allowed on CRAN
  options(future.globals.maxSize = Inf)  
} else {
  setup_disk.frame(2)
}


library(nycflights13)

# create a disk.frame
flights.df = as.disk.frame(nycflights13::flights, outdir = file.path(tempdir(),"flights13"), overwrite = TRUE)

In the following example, I will use the .N from the data.table package to count the unique combinations year and month within each chunk.

library(data.table)
#> 
#> Attaching package: 'data.table'
#> The following objects are masked from 'package:dplyr':
#> 
#>     between, first, last
library(disk.frame)

flights.df = disk.frame(file.path(tempdir(),"flights13"))

names(flights.df)
#>  [1] "year"           "month"          "day"            "dep_time"      
#>  [5] "sched_dep_time" "dep_delay"      "arr_time"       "sched_arr_time"
#>  [9] "arr_delay"      "carrier"        "flight"         "tailnum"       
#> [13] "origin"         "dest"           "air_time"       "distance"      
#> [17] "hour"           "minute"         "time_hour"

flights.df[,.N, .(year, month), keep = c("year", "month")]
#> data.table syntax for disk.frame may be moved to a separate package in the future
#>     year month     N
#>  1: 2013     1 27004
#>  2: 2013    10 28889
#>  3: 2013    11   237
#>  4: 2013    11 27031
#>  5: 2013    12 28135
#>  6: 2013     2   964
#>  7: 2013     2 23987
#>  8: 2013     3 28834
#>  9: 2013     4  3309
#> 10: 2013     4 25021
#> 11: 2013     5 28796
#> 12: 2013     6  2313
#> 13: 2013     6 25930
#> 14: 2013     7 29425
#> 15: 2013     8   775
#> 16: 2013     8 28552
#> 17: 2013     9 27574

All data.table syntax are supported. However, disk.frame adds the ability to load only those columns required for the analysis using the keep = option. In the above analysis, only the year and month variables are required and hence keep = c("year", "month") was used.

Alternatively, we can use the srckeep function to achieve the same, e.g.

srckeep(flights.df, c("year", "month"))[,.N, .(year, month)]

External variables are captured

disk.frame sends the computation to background workers which are essentially distinct and separate R sessions. Typically, the variables that you have available in your current R session aren’t visible in the other R sessions, but disk.frame uses the future package’s variable detection abilities to figure out which variables are in use and then send them to the background workers so they have access to the variables as well. E.g.

y = 42 
some_fn <- function(x) x


flights.df[,some_fn(y)]
#> data.table syntax for disk.frame may be moved to a separate package in the future
#> [1] 42 42 42 42 42 42

In the above example, neither some_fn nor y are defined in the background workers’ environments, but disk.frame still manages to evaluate this code flights.df[,some_fn(y)].