data-table-syntax.Rmd
disk.frame
supports data.table
syntaxlibrary(disk.frame)
#> Loading required package: dplyr
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
#> Loading required package: purrr
#> Registered S3 method overwritten by 'pryr':
#> method from
#> print.bytes Rcpp
#>
#> ## Message from disk.frame:
#> We have 1 workers to use with disk.frame.
#> To change that, use setup_disk.frame(workers = n) or just setup_disk.frame() to use the defaults.
#>
#>
#> It is recommended that you run the following immediately to set up disk.frame with multiple workers in order to parallelize your operations:
#>
#>
#> ```r
#> # this will set up disk.frame with multiple workers
#> setup_disk.frame()
#> # this will allow unlimited amount of data to be passed from worker to worker
#> options(future.globals.maxSize = Inf)
#> ```
#>
#> Attaching package: 'disk.frame'
#> The following objects are masked from 'package:purrr':
#>
#> imap, imap_dfr, map, map2
#> The following objects are masked from 'package:base':
#>
#> colnames, ncol, nrow
# set-up disk.frame to use multiple workers
if(interactive()) {
setup_disk.frame()
# highly recommended, however it is pun into interactive() for CRAN because
# change user options are not allowed on CRAN
options(future.globals.maxSize = Inf)
} else {
setup_disk.frame(2)
}
#> The number of workers available for disk.frame is 2
library(nycflights13)
# create a disk.frame
flights.df = as.disk.frame(nycflights13::flights, outdir = file.path(tempdir(),"flights13"), overwrite = TRUE)
In the following example, I will use the .N
from the data.table
package to count the unique combinations year
and month
within each chunk.
library(data.table)
#>
#> Attaching package: 'data.table'
#> The following object is masked from 'package:purrr':
#>
#> transpose
#> The following objects are masked from 'package:dplyr':
#>
#> between, first, last
library(disk.frame)
flights.df = disk.frame(file.path(tempdir(),"flights13"))
names(flights.df)
#> [1] "year" "month" "day" "dep_time"
#> [5] "sched_dep_time" "dep_delay" "arr_time" "sched_arr_time"
#> [9] "arr_delay" "carrier" "flight" "tailnum"
#> [13] "origin" "dest" "air_time" "distance"
#> [17] "hour" "minute" "time_hour"
flights.df[,.N, .(year, month), keep = c("year", "month")]
#> year month N
#> 1: 2013 1 27004
#> 2: 2013 10 28889
#> 3: 2013 11 237
#> 4: 2013 11 27031
#> 5: 2013 12 28135
#> 6: 2013 2 964
#> 7: 2013 2 23987
#> 8: 2013 3 28834
#> 9: 2013 4 3309
#> 10: 2013 4 25021
#> 11: 2013 5 28796
#> 12: 2013 6 2313
#> 13: 2013 6 25930
#> 14: 2013 7 29425
#> 15: 2013 8 775
#> 16: 2013 8 28552
#> 17: 2013 9 27574
All data.table
syntax are supported. However, disk.frame
adds the ability to load only those columns required for the analysis using the keep =
option. In the above analysis, only the year
and month
variables are required and hence keep = c("year", "month")
was used.
Alternatively, we can use the srckeep
function to achieve the same, e.g.
disk.frame
sends the computation to background workers which are essentially distinct and separate R sessions. Typically, the variables that you have available in your current R session aren’t visible in the other R sessions, but disk.frame
uses the future
package’s variable detection abilities to figure out which variables are in use and then send them to the background workers so they have access to the variables as well. E.g.
In the above example, neither some_fn
nor y
are defined in the background workers’ environments, but disk.frame
still manages to evaluate this code flights.df[,some_fn(y)]
.