disk.frame supports data.table syntax

library(disk.frame)
#> Loading required package: dplyr
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
#> Loading required package: purrr
#> Registered S3 method overwritten by 'pryr':
#>   method      from
#>   print.bytes Rcpp
#> 
#> ## Message from disk.frame:
#> We have 1 workers to use with disk.frame.
#> To change that, use setup_disk.frame(workers = n) or just setup_disk.frame() to use the defaults.
#> 
#> 
#> It is recommended that you run the following immediately to set up disk.frame with multiple workers in order to parallelize your operations:
#> 
#> 
#> ```r
#> # this will set up disk.frame with multiple workers
#> setup_disk.frame()
#> # this will allow unlimited amount of data to be passed from worker to worker
#> options(future.globals.maxSize = Inf)
#> ```
#> 
#> Attaching package: 'disk.frame'
#> The following objects are masked from 'package:purrr':
#> 
#>     imap, imap_dfr, map, map_dfr, map2
#> The following objects are masked from 'package:base':
#> 
#>     colnames, ncol, nrow

# set-up disk.frame to use multiple workers
if(interactive()) {
  setup_disk.frame()
  # highly recommended, however it is pun into interactive() for CRAN because
  # change user options are not allowed on CRAN
  options(future.globals.maxSize = Inf)  
} else {
  setup_disk.frame(2)
}
#> The number of workers available for disk.frame is 2


library(nycflights13)

# create a disk.frame
flights.df = as.disk.frame(nycflights13::flights, outdir = file.path(tempdir(),"flights13"), overwrite = TRUE)

In the following example, I will use the .N from the data.table package to count the unique combinations year and month within each chunk.

library(data.table)
#> 
#> Attaching package: 'data.table'
#> The following object is masked from 'package:purrr':
#> 
#>     transpose
#> The following objects are masked from 'package:dplyr':
#> 
#>     between, first, last
library(disk.frame)

flights.df = disk.frame(file.path(tempdir(),"flights13"))

names(flights.df)
#>  [1] "year"           "month"          "day"            "dep_time"      
#>  [5] "sched_dep_time" "dep_delay"      "arr_time"       "sched_arr_time"
#>  [9] "arr_delay"      "carrier"        "flight"         "tailnum"       
#> [13] "origin"         "dest"           "air_time"       "distance"      
#> [17] "hour"           "minute"         "time_hour"

flights.df[,.N, .(year, month), keep = c("year", "month")]
#>     year month     N
#>  1: 2013     1 27004
#>  2: 2013    10 28889
#>  3: 2013    11   237
#>  4: 2013    11 27031
#>  5: 2013    12 28135
#>  6: 2013     2   964
#>  7: 2013     2 23987
#>  8: 2013     3 28834
#>  9: 2013     4  3309
#> 10: 2013     4 25021
#> 11: 2013     5 28796
#> 12: 2013     6  2313
#> 13: 2013     6 25930
#> 14: 2013     7 29425
#> 15: 2013     8   775
#> 16: 2013     8 28552
#> 17: 2013     9 27574

All data.table syntax are supported. However, disk.frame adds the ability to load only those columns required for the analysis using the keep = option. In the above analysis, only the year and month variables are required and hence keep = c("year", "month") was used.

Alternatively, we can use the srckeep function to achieve the same, e.g.

srckeep(flights.df, c("year", "month"))[,.N, .(year, month)]

External variables are captured

disk.frame sends the computation to background workers which are essentially distinct and separate R sessions. Typically, the variables that you have available in your current R session aren’t visible in the other R sessions, but disk.frame uses the future package’s variable detection abilities to figure out which variables are in use and then send them to the background workers so they have access to the variables as well. E.g.

y = 42 
some_fn <- function(x) x


flights.df[,some_fn(y)]
#> [1] 42 42 42 42 42 42

In the above example, neither some_fn nor y are defined in the background workers’ environments, but disk.frame still manages to evaluate this code flights.df[,some_fn(y)].