vignettes/05-data-table-syntax.Rmd
05-data-table-syntax.Rmd
disk.frame
supports data.table
syntax
library(disk.frame)
# set-up disk.frame to use multiple workers
if(interactive()) {
setup_disk.frame()
# highly recommended, however it is pun into interactive() for CRAN because
# change user options are not allowed on CRAN
options(future.globals.maxSize = Inf)
} else {
setup_disk.frame(2)
}
library(nycflights13)
# create a disk.frame
flights.df = as.disk.frame(nycflights13::flights, outdir = file.path(tempdir(),"flights13"), overwrite = TRUE)
In the following example, I will use the .N
from the data.table
package to count the unique combinations year
and month
within each chunk.
library(data.table)
#>
#> Attaching package: 'data.table'
#> The following objects are masked from 'package:dplyr':
#>
#> between, first, last
library(disk.frame)
flights.df = disk.frame(file.path(tempdir(),"flights13"))
names(flights.df)
#> [1] "year" "month" "day" "dep_time"
#> [5] "sched_dep_time" "dep_delay" "arr_time" "sched_arr_time"
#> [9] "arr_delay" "carrier" "flight" "tailnum"
#> [13] "origin" "dest" "air_time" "distance"
#> [17] "hour" "minute" "time_hour"
flights.df[,.N, .(year, month), keep = c("year", "month")]
#> data.table syntax for disk.frame may be moved to a separate package in the future
#> year month N
#> 1: 2013 1 27004
#> 2: 2013 10 28889
#> 3: 2013 11 237
#> 4: 2013 11 27031
#> 5: 2013 12 28135
#> 6: 2013 2 964
#> 7: 2013 2 23987
#> 8: 2013 3 28834
#> 9: 2013 4 3309
#> 10: 2013 4 25021
#> 11: 2013 5 28796
#> 12: 2013 6 2313
#> 13: 2013 6 25930
#> 14: 2013 7 29425
#> 15: 2013 8 775
#> 16: 2013 8 28552
#> 17: 2013 9 27574
All data.table
syntax are supported. However, disk.frame
adds the ability to load only those columns required for the analysis using the keep =
option. In the above analysis, only the year
and month
variables are required and hence keep = c("year", "month")
was used.
Alternatively, we can use the srckeep
function to achieve the same, e.g.
disk.frame
sends the computation to background workers which are essentially distinct and separate R sessions. Typically, the variables that you have available in your current R session aren’t visible in the other R sessions, but disk.frame
uses the future
package’s variable detection abilities to figure out which variables are in use and then send them to the background workers so they have access to the variables as well. E.g.
y = 42
some_fn <- function(x) x
flights.df[,some_fn(y)]
#> data.table syntax for disk.frame may be moved to a separate package in the future
#> [1] 42 42 42 42 42 42
In the above example, neither some_fn
nor y
are defined in the background workers’ environments, but disk.frame
still manages to evaluate this code flights.df[,some_fn(y)]
.